Are bitwise operations still practical? - c

Wikipedia, the one true source of knowledge, states:
On most older microprocessors, bitwise
operations are slightly faster than
addition and subtraction operations
and usually significantly faster than
multiplication and division
operations. On modern architectures,
this is not the case: bitwise
operations are generally the same
speed as addition (though still faster
than multiplication).
Is there a practical reason to learn bitwise operation hacks or it is now just something you learn for theory and curiosity?

Bitwise operations are worth studying because they have many applications. It is not their main use to substitute arithmetic operations. Cryptography, computer graphics, hash functions, compression algorithms, and network protocols are just some examples where bitwise operations are extremely useful.
The lines you quoted from the Wikipedia article just tried to give some clues about the speed of bitwise operations. Unfortunately the article fails to provide some good examples of applications.

Bitwise operations are still useful. For instance, they can be used to create "flags" using a single variable, and save on the number of variables you would use to indicate various conditions. Concerning performance on arithmetic operations, it is better to leave the compiler do the optimization (unless you are some sort of guru).

They're useful for getting to understand how binary "works"; otherwise, no. In fact, I'd say that even if the bitwise hacks are faster on a given architecture, it's the compiler's job to make use of that fact — not yours. Write what you mean.

The only case where it makes sense to use them is if you're actually using your numbers as bitvectors. For instance, if you're modeling some sort of hardware and the variables represent registers.
If you want to perform arithmetic, use the arithmetic operators.

Depends what your problem is. If you are controlling hardware you need ways to set single bits within an integer.
Buy an OGD1 PCI board (open graphics card) and talk to it using libpci. http://en.wikipedia.org/wiki/Open_Graphics_Project

It is true that in most cases when you multiply an integer by a constant that happens to be a power of two, the compiler optimises it to use the bit-shift. However, when the shift is also a variable, the compiler cannot deduct it, unless you explicitly use the shift operation.

Funny nobody saw fit to mention the ctype[] array in C/C++ - also implemented in Java. This concept is extremely useful in language processing, especially when using different alphabets, or when parsing a sentence.
ctype[] is an array of 256 short integers, and in each integer, there are bits representing different character types. For example, ctype[;A'] - ctype['Z'] have bits set to show they are upper-case letters of the alphabet; ctype['0']-ctype['9'] have bits set to show they are numeric. To see if a character x is alphanumeric, you can write something like 'if (ctype[x] & (UC | LC | NUM))' which is somewhat faster and much more elegant than writing 'if ('A' = x <= 'Z' || ....'.
Once you start thinking bitwise, you find lots of places to use it. For instance, I had two text buffers. I wrote one to the other, replacing all occurrences of FINDstring with REPLACEstring as I went. Then for the next find-replace pair, I simply switched the buffer indices, so I was always writing from buffer[in] to buffer[out]. 'in' started as 0, 'out' as 1. After completing a copy I simply wrote 'in ^= 1; out ^= 1;'. And after handling all the replacements I just wrote buffer[out] to disk, not needing to know what 'out' was at that time.
If you think this is low-level, consider that certain mental errors such as deja-vu and its twin jamais-vu are caused by cerebral bit errors!

Working with IPv4 addresses frequently requires bit-operations to discover if a peer's address is within a routable network or must be forwarded onto a gateway, or if the peer is part of a network allowed or denied by firewall rules. Bit operations are required to discover the broadcast address of a network.
Working with IPv6 addresses requires the same fundamental bit-level operations, but because they are so long, I'm not sure how they are implemented. I'd wager money that they are still implemented using the bit operators on pieces of the data, sized appropriately for the architecture.

Of course (to me) the answer is yes: there can be practical reasons to learn them. The fact that nowadays, e.g., an add instruction on typical processors is as fast as an or/xor or an and just means that: an add is as fast as, say, an or on those processors.
The improvements in speed of instructions like add, divide, and so on, just means that now on those processors you can use them and being less worried about performance impact; but it is true now as in the past that you usually won't change every adds to bitwise operations to implement an add. That is, in some cases it may depend on which hacks: likely some hack now must be considered educational and not practical anymore; others could have still their practical application.

Related

Can a C compiler implement signed right shift "unreasonably"?

The title field isn't long enough to a capture a detailed question, so for the record, my actual question defines "unreasonable" in a specific way:
Is it legal1 for a C implementation have an arithmetic right
shift operator that returns different results, over time, for the
identical argument values? That is, must >> be a true function?
Imagine you wanted to write portable code using right-shift >> on signed values in C. Unfortunately, for you, the most efficient implementation of some critical part of your algorithm is fastest when signed right shifts fill the sign bit (i.e., they are arithmetic right shifts). Now since this behavior is implementation defined you are kind of screwed if you want to write portable code that takes advantage of it.
Just reading the compiler's documentation (which it is required to provide in the standard, but may or may not actually be easy to access or even exist) is great if you know all the compilers you will run on and will ever run on, but since that is often impossible, you might look for a way to do this portably.
One way I thought of is to simply test the compiler's behavior at runtime: it it appears to implement arithmetic2 right shift, use the optimized algorithm, but if not use a fallback that doesn't rely on it. Of course, just checking that say (short)0xFF00 >> 4 == 0xFFF0 isn't enough since it doesn't exclude that perhaps char or int values work differently, or even the weird case that it fills for some shift amounts or values but not for others3.
So given that a comprehensive approach would be to exhaustively check all shift values and amounts. For all 8-bit char inputs that's only 28 LHS values and 23 RHS values for a total of 211, while short (typically, but let's say int16_t if you want to be pedantic) totals only 220, which could be validated in a fraction of a second on modern hardware4. Doing all 237 32-bit values would take a few seconds on decent hardware, but still possibly reasonable. 64-bit is out for the foreseeable future however.
Let's say you did that and found that the >> behavior exactly matches the desired arithmetic shift behavior. Are you safe to rely on it according to the standard: does the standard constrain the implementation not to change its behavior at runtime? That is, does the behavior have to be expressible as a function of it's inputs, or can (short)0xFF00 >> 4 be 0xFFF0 one moment and then 0x0FF0 (or any other value) the next?
Now this question is mostly theoretical, but violating this probably isn't as crazy as it might seem, given the presence of hybrid architectures such as big.LITTLE that dynamically move processes from on CPU to another with possible small behavioral differences, or live VM migration between say chips manufactured by Intel and AMD with subtle (usually undocumented) instruction behavior differences.
1 By "legal" I mean according to the C standard, not according to the laws of your country, of course.
2 I.e., it replicates the sign bit for newly shifted in bits.
3 That would be kind of insane, but not that insane: x86 has different behavior (overflow flag bit setting) for shifts of 1 rather than other shifts, and ARM masks the shift amount in a surprising way that a simple test probably wouldn't detect.
4 At least anything better than a microcontroller. The inner validation loop is a few simple instructions, so a 1 GHz CPU could validate all ~1 million short values in ~1 ms at one instruction-per-cycle.
Let's say you did that and found that the >> behavior exactly matches the desired arithmetic shift behavior. Are you safe to rely on it according to the standard: does the standard constrain the implementation not to change its behavior at runtime? That is, does the behavior have to be expressible as a function of it's inputs, or can (short)0xFF00 >> 4 be 0xFFF0 one moment and then 0x0FF0 (or any other value) the next?
The standard does not place any requirements on the form or nature of the (required) definition of the behavior of right shifting a negative integer. In particular, it does not forbid the definition to be conditional on compile-time or run-time properties beyond the operands. Indeed, this is the case for implementation-defined behavior in general. The standard defines the term simply as
unspecified behavior where each implementation documents how the choice is made.
So, for example, an implementation might provide a macro, a global variable, or a function by which to select between arithmetic and logical shifts. Implementations might also define right-shifting a negative number to do less plausible, or even wildly implausible things.
Testing the behavior is a reasonable approach, but it gets you only a probabilistic answer. Nevertheless, in practice I think it's pretty safe to perform just a small number of tests -- maybe as few as one -- for each LHS data type of interest. It is extremely unlikely that you'll run into an implementation that implements anything other than standard arithmetic (always) or logical (always) right shift, or that you'll encounter one where the choice between arithmetic and logical shifting varies for different operands of the same (promoted) type.
The authors of the Standard made no effort to imagine and forbid every unreasonable way implementations might behave. In the days prior to the Standard, implementations almost always defined many behaviors based upon how the underlying platform worked, and the authors of the Standard almost certainly expected that most implementations would do likewise. Since the authors of the Standard didn't want to rule out the possibility that it might sometimes be useful for implementations to behave in other ways (e.g. to support code which targeted other platforms), however, they left many things pretty much wide open.
The Standard is somewhat vague as to the level of precision with which "Implementation Defined" behaviors need to be documented. I think the intention is that a person of reasonable intelligence reading the specification should be able to predict how a piece of code with Implementation-Defined behavior would behave, but I'm not sure that defining an operation as "yielding whatever happens to be in register 7" would be non-conforming.
From a practical perspective, what should matter is whether one is targeting quality implementations for platforms that have been developed within the last quarter century, or whether one is trying to force even deliberately-obtuse compilers to behave in controlled fashion. If the former, it may be worth
using a static assert to ensure the -1>>1==-1, but quality compilers for modern
hardware where that test passes are going to use arithmetic right shift
consistently. While targeting code to obtuse compilers might possibly have
some purpose, it's not generally possible to guard against all the ways that
a pathological-but-conforming compiler could sabotage one's code, and efforts
spent attempting to do so could often be more effectively spent on getting a
quality compiler.

Can bit-level operations ever be "fast" in software?

Let me clarify the soft-sounding title straight away. This is actually something that has been nagging me for quite a while now, despite feeling like a pretty basic question.
Many languages give a faulty impression of efficiency by letting the developer play with bits, such as thebool.h C header which, as I understand it, is essentially just an int with a wrapper around it. Essentially, the byte seems to be the absolute lowest atomic unit of computation in C - bool x = 0 is not faster/more memory efficient than int x = 0.
What I'm wondering is then, what do we do when we want to implement an algorithm that is inherently tied to loading and manipulating single bits, such as decoding binary codes, unweighted graph connectivity problems and many others? In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
EDIT: Pretty surprised by the downvotes, but I suppose people just didn't understand what I was asking. I think a really good, canonical example is traversing a binary tree (or any other sequential list of yes/no questions really). What I was wondering is if modern cpu architectures are fundamentally poorly equipped to do this (as compared to an ASIC/FPGA, that is), or if this is an artifact of some abstraction layer (language/kernel/etc). Mark's answer was good though (although I'd love a reference to the mentioned architecture extension)
No you can't rival the efficiency of an ASIC. An ASIC means you can replicate parallel bit streams as much as you have budget for on the chip. You just cut and paste your HDL until you fill your die space. A CPU only has a limited number of cores.
I'm guessing that you think that bit operations like z = (x|(1<<y)>>4 are slow and yes, all that bit shifting is extra overhead. But that is just accessing the bits. The bit operations (OR, AND, etc) are all as fast as you can get on modern CPU, i.e. 1 cycle throughput.
The 8051 architecture has a way of accessing individual bits directly, without using byte registers, but if you are worried about speed, you wouldn't consider a 8051.
By convention, a byte is the smallest addressable piece of memory in a computer. The number of bits that a byte has can differ from one system to another.
In the case of x86, there are instructions to move bytes from memory to a register and back, and instructions to manipulate values in registers. I can't speak to other architectures, but they most likely work in a similar way.
So anytime you need to manipulate some number of bits you need to do so a byte (or word, i.e. multiple bytes) at a time.
I also don't know why this question got so many downvotes, the question:
In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
seems reasonable to me. It's certainly not a bad compared to many questions on stackoverflow.
The answer is: no CPUs can't match the efficiency of an ASIC.
However, the reason is not because CPUs are manipulating bytes instead of bits. Instead it's because most of the work that CPUs do to process an instruction is involved with loading it from memory, decoding it, tracking dependencies, etc., rather than performing the actual arithmetic operations on bits or bytes that the instruction directs the CPU to perform.
A good explanation of this is shown in the following presentation from the 2014 LLVM developers meeting. The presentation shows how OpenCL can be used to generate custom FPGA hardware. Slides 12 to 28 show a nice pictorial example of overhead associated with a CPU algorithm and how custom hardware can remove much of this overhead.

Bitwise operation over a simple piece of code

Recently I came across a code that can compute the largest number given two numbers using XOR. While this looks nifty, the same thing can be achieved by a simple ternary operator or an if else. Not pertaining to just this example, but do bitwise operations have any advantage over normal code? If so, is this advantage in speed of computation or memory usage? I am assuming in bitwise operations the assembly code will look much simpler than normal code. On a related note, while programming embedded systems which is more efficient?
*Normal code refers to how you'd normally do it. For example a*2 is normal code and I can achieve the same thing with a<<1
do bitwise operations have any advantage over normal code?
Bitwise operations are normal code. Most compilers these days have optimizers that generate the same instruction for a << 1 as for a * 2. On some hardware, especially on low-powered microprocessors, shift operations take fewer CPU cycles than multiplication, but there is hardware on which this makes no difference.
In your specific case there is an advantage, though: the code with XOR avoids branching, which has a great potential of speeding up the code. When there is no branching, CPU can use pipelining to perform the same operations much faster.
when programming embedded systems which is more efficient?
Embedded systems often have less powerful CPUs, so bitwise operations do have an advantage. For example, on 68HC11 CPU multiplication takes 10 cycles, while shifting left takes only 3.
Note, however, that it does not mean that you should be using bitwise operations explicitly. Most compilers, including embedded ones, will convert multiplication by a constant to a sequence of shifts and additions in case it saves CPU cycles.
Bitwise operators generally have the advantage of being constant time, regardless of input values. Conditional moves and branches may be the target of timing attacks in certain applications, such as crypto libraries, while bitwise operations are not subject to such attacks. (Disregarding cache timing attacks, etc.)
Generally, if a processor is capable of pipelining, it would be more efficient to use bitwise operations than conditional moves or branches, bypassing the entire branch prediction problem. This may or may not speed up your resulting code.
You do have to be careful, though, as some operations constitute undefined behavior in C, such as shifting signed integers, etc. For this reason, it may be to your advantage to do things the "normal" way.
On some platforms branches are expensive, so finding a way to get the min(x,y) without branching has some merit. I think this is particularly useful in CUDA, where the pipelines in the hardware are long.
Of course, on other platforms (like ARM) with conditional execution and compilers that emit those op-codes, it boils down to a compare and a conditional move (two instructions) with no pipeline bubble. Almost certainly better than a compare, and a few logical operations.
Since the poster asks it with the Embedded tag listed, I will try to reflect primarily that in my answer.
In short, usually you shouldn't try to be "creative" with your coding, since it becomes harder to understand later! (The old saying, "premature optimization is the root of all evils")
So only do anything alike when you know what you are doing, precisely, and in any other case, try to write the most understandable C code.
Well, this was the general part, now lets get on what such tricks could do, how they could affect the execution time.
First thing, in embedded, it is good to check the disassembly listing. If you use a variant of GCC with -O2 optimizations, you can usually assume it is quite clever understanding what the code is meant to do, and will produce the result which is likely fine. It can even use such tricks by itself figuring out the code, if it "sees" that it will be faster on the target CPU, so you don't need to ruin the understandability of your code with tricks. With other compilers, results may vary, in doubt, the assembly listing should be observed to see if execution times could be improved utilizing such bit hack tricks.
On the usual embedded platform, especially at 8 bits, you don't need to care that much about pipeline (and related, branch mispredictions) since it is short (or nonexistent). So you usually gain nothing by eliminating a conditional at the cost of an arithmetic operation, and could actually ruin performance by utilizing some elaborate hacks.
On faster 32 bit CPUs usually there is a longer pipeline and branch predictor to eliminate flushes (costing many cycles), so eliminating conditionals may pay off. But only if they are of such nature that the branch predictor can not guess them right (such as comparisons on "random" data), otherwise the conditionals may still be the better, taking the most minimal time (single cycle or even "less" if the CPU is capable to process more than one operation per cycle) when they were predicted right.

Looking for Ansi C89 arbitrary precision math library

I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!
Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.

Is "faked subtraction" ever used in the real world?

I'm taking a Computer Systems class as a pre-req for my Masters and came across something I found fascinating and hard to see practical use of and that is "faking subtraction" and the fact that there doesn't need to be a subtraction instruction.
Something like:
x - y
Can be written as:
x + (~y + 1)
Now, that's all well and good but it seems like that is overly complicated for a simple subtraction, especially when you could just easily put "x - y". Are there situations where it would be necessary to do this, or is it just something that CAN be done but isn't.
This is often how it's done at the hardware level (i.e. inside the ALU).
At the software level, it's generally useless, as it can never be more efficient than the straightfoward subtraction (unless you have a truly bizarre compiler/platform combination).
The two's complement implementation is done in hardware, so you do not need to implement them like that for builtin datatypes.
If you are making an n-bit integer arithmetic library, then you need to emulate the integer addition, subtraction, multiplication and division etc operations, in which case such a technique might be implemented to add the n-bit length numbers, but using the carry flag to do so is a better implementation in my opinion.
It should be obvious that that is how substraction is done internally, so I'm not sure what you mean by "being used in the real world". This is why two's complement was chosen in the first place, because subtraction is just overflowing negative addition.
I do not see any reason to do it in your C code. Doing it in software is no faster than subtracting using the minus operator - and is a lot more unclear.
However, that is the way processors execute subtraction. I bet you have seen this code as an example of what hardware does, since it is easier to see how x + (~y + 1) will become a logic circuit.
So... no, you will not use this code in real world, but this operation is executed a lot of times in your processor.
I couldn't see the point of doing this. It is not anymore efficient. In fact if it's not optimised out by the compiler it ends up generating more opcodes.
Stuff like this was more common back before CPU's had billions of transistors to play with. A particular CPU might not implement a specific subtract opcode, and so a compiler (or assembly program) targeting it would have to know that trick.
These manipulations can also help you understand the internal implementation of CPU's. For example, CPU's division operations are sometimes accomplished by taking the reciprocal of the divisor and multiplying it by the dividend; the reciprocal is the only actual "division" being performed.

Resources