Context
I am using a lot of bitwise operations but I don't even know how they are implemented at the lowest level possible.
I would like to see how the intel/amd devs achieve to implement such operations. Not to replace them in my code, that would be dumb.. But to get a broader catch on what is going on.
I tried to find some info but most of the time, people ask about its use or to replace it with other bitwise operations, which is not the case here.
Questions
Is it doing basic iterations in assembly(sse) over the 32 bits and compare ?
Are there some tricks to get it up to speed ?
Thanks
Most all are implemented directly on the CPU, as basic, native instructions, not part of SSE. These are the oldest, most basic operations on the CPU register.
As to how and, or, xor, etc. are implemented, if you are really interested, look up digital logic design, or discrete math. Lookup up Flip-flops, AND gates, or NAND / NOR / XOR gates
https://en.wikipedia.org/wiki/NAND_logic
Also lookup K-maps (Karnaugh maps), these are what you can use to implement a logic circuit by hand.
https://en.wikipedia.org/wiki/Karnaugh_map
If you really enjoy the reading, you can signup for a digital logic design class if you have access to an engineering or computer science university. You will get to build logic circuits with large ICs on a breadboard, but nowadays most CPUs are "written" with code, like software, and "printed" on a silicon wafer.
Of particular interest is NAND and NOR due to their functional completeness (you can use NAND or NOR to construct any truth table).
NAND (logical symbol looks like =Do-)
A
=Do- Q is Q = NOT(A AND B)
B
Truth table
A B Q
0 0 1
0 1 1
1 0 1
1 1 0
You can rewrite any logic with NAND.
As you can also see, its pretty efficient, you can't get any lower level than a single gate with binary (though there is ternary / tri-state logic), so its a single clock state change. So for a 64-bit CPU register, you'll need 64 of these babies side by side, PER register... PER core... PER instruction. And that is only the "logical" registers. Because advanced processors (like Intel Core) do register renaming, you have more physical registers in silicon than logically available to you by name.
AND, OR, XOR, and NOT operations are implemented quite efficiently in silicon, and so are generally a single-cycle native instruction on most processors. That is, for a 16-bit processor, whole 16-bit registers are ANDed at once; on a 32-bit processor, 32 bits at once, etc. The only performance issue you might want to be aware of is alignment: on an ARM processor, for example, if a 32-bit value starts at a memory address that is a multiple of 4, then a read-modify-write can be done in two or three cycles. If it's at an odd address, it has to do two reads at the neighboring aligned addresses and two writes, and so is slower.
Bit shifting in some older processors may involve looping over single shifts. That is, 1 << 5 will take longer than 1 << 2. But most modern processors have what is called a "barrel shifter" that equalizes all shifts up to the register size, so on a Pentium, 1 << 31 takes no longer than 1 << 2.
Addition and subtraction are fast primitives as well. Multiplication and division are tricky: these are mostly implemented as microcode loops. Multiplication can be sped up by unrolling the loops into huge swaths of silicon in a high-end processor, but division cannot, so generally division is the slowest basic operation in a microprocessor.
Bitwise operations are what processors are made of, so it is natural to expose those operations with instructions. Operations like AND, OR, XOR, NOR, NAND and NOT can be performed by the ALU with only a few logic gates per bit. Importantly, each bit of the result only relies on two bits of the input (unlike multiplication or addition), so the entire operation can proceed in parallel without any complication.
As you know, data in computers is represented in a binary format.
For example, if you have the integer 13 it's represented as 1101b (where b means binary). This works out to (1) * 8 + (1) * 4 + (0) * 2 + (1) * 1 = 13, just like (1) * 10 + (3) * 1 = 13 -- different bases.
However, for basic operations computers need to know how much data you're working with. A typical integer size is 32 bits. So it's not just 1101b, it's 00000000000000000000000000001101b -- 32 bits, most of them unused.
Bitwise operations are just that -- they operate only on a bit level. Adding, multiplying, and other operations consider multiple bits at a time to perform their function, but bitwise operators do not. For example:
What's 12 bitwise-and 7? (in C vernacular, 12 & 7)
1010b 12 &
0111b 7
----- =
0010n 2
Why? Think vertically! Look at the left set of digits -- 1 and 0 is 0. Then, 0 and 1 is 0. Then, 1 and 1 is 1. Finally, 0 and 1 is 0.
This is based on the and truth table that states these rules -- that only true (aka 1) and true (aka 1) results in false (aka 0). All other resultant values are false (aka 0).
Likewise, the or truth table states that all results are true (aka 1) except for false (aka 0) and false (aka 0) which results in false (aka 0).
Let's do the same example, but this time let's computer 12 bitwise-or 7. (Or in C vernacular, 12 | 7)
1010b 12 |
0111b 7
----- =
1111n 15
And finally, let's consider one other principal bitwise operator: not. This is a unary operator where you simply flip each bit. Let's compute bitwise-not 7 (or in C vernacular, ~7)
0111b ~7
----- =
1000b 8
But wait.. What about all those leading zeroes? Well, yes, before I was omitting them because they weren't important, but now they surely are:
00000000000000000000000000000111b ~7
--------------------------------- =
11111111111111111111111111111000b ... big number?
If you're instructing the computer to treat the result as an unsigned integer (32-bit), that's a really big number. (Little less than 4 billion). If you're instructing the computer to treat the result as a signed integer (32-bit) that's -8.
As you may have guessed, since the logic is really quite simple for all these operations, there's not much you can do to make them individually faster. However, bitwise operations obey the same logic as boolean logic, and thus you can use boolean logic reduction techniques to reduce the number of bitwise operations you may need.
e.g. (A & B) | (A & C) results in the same as A & (B | C)
However, that's a much larger topic. Karnaugh maps are one technique, but boolean algebra is usually what I end up using while programming.
Related
We need to find new bits turned ON in interlock status received from the device compared to the last status read. This is for firing error codes for bits that are newly set. I am using the following statement.
bits_on =~last_status & new_status;
Is there any better ways to do this?
It's only 2 operations and an assignment, so the only way to improve it would be to do it in 1 operation and an assignment. It doesn't correspond to any of the simple C bit manipulation operators, so doing it in 1 operation is not possible.
However, depending on your architecture your compiler might actually already be compiling it to a single instruction.
ANDN (Logical AND NOT) is part of the BMI1 instruction set, and is equivalent to ~x & y.
I'm looking for the fastest possible way to permutate bits in a 64 bit integer.
Given a table called "array" corresponding to a permutations array, meaning it has a size of 64 and filled with unique numbers (i.e. no repetition) ranging from 0 to 63, corresponding to bit positions in a 64 bit integer, I can permutate bits this way
bit = GetBitAtPos(integer_, array[i]);
SetBitAtPos(integer_, array[i], GetBitAtPos(integer_, i));
SetBitAtPos(integer_, i, bit);
(by looping i from 0 to 63)
GetBitAtPos being
GetBitAtPos(integer_, pos) { return (integer >>) pos & 1 }
Setbitatpos is also founded on the same principle (i.e. using C operators),
under the form SetBitAtPos(integer, position, bool_bit_value)
I was looking for a faster way, if possible, to perform this task. I'm open to any solution, including inline assembly if necessary. I have difficulty to figure a better way than this, so I thought I'd ask.
I'd like to perform such a task to hide data in a 64 bit generated integer (where the 4 first bit can reveal informations). It's a bit better than say a XOR mask imo (unless I miss something), mostly if someone tries to find a correlation.
It also permits to do the inverse operation to not lose the precious bits...
However I find the operation to be a bit costly...
Thanks
Since the permutation is constant, you should be able to come up with a better way than moving the bits one by one (if you're OK with publishing your secret permutation, I can have a go at it). The simplest improvement is moving bits that have the same distance (that can be a modular distance because you can use rotates) between them in the input and output at the same time. This is a very good methods if there are few such groups.
If that didn't work out as well as you'd hoped, see if you can use bit_permute_steps to move all or most of the bits. See the rest of that site for more ideas.
If you can use PDEP and PEXT, you can move bits in groups where the distance between bits can arbitrarily change (but their order can not). It is, afaik, unknown how fast they will be though (and they're not available yet).
The best method is probably going to be a combination of these and other tricks mentioned in other answers.
There are too many possibilities to explore them all, really, so you're probably not going to find the best way to do the permutation, but using these ideas (and the others that were posted) you can doubtlessly find a better what than you're currently using.
PDEP and PEXT have been available for a while now so their performance is known, at 3 cycle latency and 1/cycle throughput they're faster than most other useful permutation primitives (except trivial ones).
Split your bits into subsets where this method works:
Extracting bits with a single multiplication
Then combine the results using bitwise OR.
For 64-bit number I believe the problem (of finding best algorithm) may be unsolvable due to huge amount of possibilities. One of the most scalable and easiest to automatize would be look up table:
result = LUT0[ value & 0xff] +
LUT1[(value >> 8) & 0xff] +
LUT2[(value >> 16) & 0xff] + ...
+ LUT7[(value >> 56) & 0xff];
Each LUT entry must be 64-bit wide and it just spreads each 8 bits in a subgroup to the full range of 64 possible bins. This configuration uses 16k of memory.
The scalability comes from the fact that one can use any number of look up tables (practical range from 3 to 32?). This method is vulnerable to cache misses and it can't be parallelized (for large table sizes at least).
If there are certain symmetries, there are some clever trick available --
e.g. swapping two bits in Intel:
test eax, (1<<BIT0 | 1<<BIT1)
jpe skip:
xor eax, (1<<BIT0 | 1<<BIT1)
skip:
This OTOH is highly vulnerable to branch mispredictions.
Which is the best way, in C, to see if a number is divisible by another? I use this:
if (!(a % x)) {
// this will be executed if a is divisible by x
}
Is there anyway which is faster? I know that doing, i.e, 130 % 13 will result into doing 130 / 13 per 10 times. So there are 10 cycles when just one is needed (I just want to know if 130 is divisible by 13).
Thanks!
I know that doing, i.e, 130 % 13 will result into doing 130 / 13 per 10 times
Balderdash. % does no such thing on any processor I've ever used. It does 130/13 once, and returns the remainder.
Use %. If your application runs too slowly, profile it and fix whatever is too slow.
For two arbitrary numbers, the best way to check is to check whether a % b == 0. The modulus operator has different performance based on the hardware, but your compiler can figure this out much better than you can. The modulus operator is universal, and your compiler will figure out the best sequence of instructions to emit for whatever hardware you're running on.
If one of the numbers is a constant, your compiler might optimize by doing some combination of bit shifts and subtractions (mostly for powers of two), since hardware div/mod is slower than addition or subtraction, but on modern processors the latency (already only a few nanoseconds) is hidden by tons of other performance tricks, so you needn't worry about it. No hardware computes modulus by repeated division (some old processors did division by repeated bit shifts and subtraction, but they still used specialized hardware for this, so it's still faster to have the hardware do it than to try to emulate it in software). Most modern ISAs actually compute both division and remainder in one instruction.
The only optimization that might be useful is if the divisor is a power of two. Then you can use & to mask the low-order bits (by divisor - 1) and check the result against zero. For example, to check if a is divisible by 8, a & 7 == 0 is equivalent. A good compiler will do this for you, so stick with just stick with %.
In the general case, using the modulo operator is likely to be the fastest method available. There are exceptions, particularly if you are interested in whether numbers are divisible by powers of two (in which case bitwise operations are available), but the compiler should choose them automatically for you if you just use %. You are unlikely to be able to do any better for arbitrary values such as 13.
Also, what do you mean by "doing 130 / 13 per 10 times"? It does 130 / 13 once. Which is exactly what is required.
If x is a constant, then yes:
if (a * 0x4ec4ec4ec4ec4ec5 < 0x13b13b13b13b13b2) {
// this will be executed if a is divisible by 13
}
0x4ec4ec4ec4ec4ec5 is the modular multiplicative inverse of 13 (modulo 264), so if a is really a multiple of 13 then the product will be less than (264/13). (Because a is 13 times some integer n, and n must have fit into a 64-bit word which implies that it was less than 264.)
This only works for odd values of x. For even numbers (i.e. multiples of 2y for y>0) you can combine this test with a bitwise-AND test (the last y bits of a should be zero. If they are then divide a by 2y and proceed with the multiplication test.)
This is only worthwhile if x is a constant, because computing the multiplicative inverse is more expensive than integer division.
Edit: I am also assuming a and x are unsigned.
When the machine does % it just does a division instruction, and that automatically generates a remainder.
However, be aware that if one of the numbers is negative, % will give a negative remainder.
If you only care about a remainder of zero, this is not a problem,
but if you happen to be looking for another remainder, like 1, it can really trip you up.
For a specific need I am building a four byte integer out of four one byte chars, using nothing too special (on my little endian platform):
return (( v1 << 24) | (v2 << 16) | (v3 << 8) | v4);
I am aware that an integer stored in a big endian machine would look like AB BC CD DE instead of DE CD BC AB of little endianness, although would it affect the my operation completely in that I will be shifting incorrectly, or will it just cause a correct result that is stored in reverse and needs to be reversed?
I was wondering whether to create a second version of this function to do (yet unknown) bit manipulation for a big-endian machine, or possibly to use ntonl related function which I am unclear of how that would know if my number is in correct order or not.
What would be your suggestion to ensure compatibility, keeping in mind I do need to form integers in this manner?
As long as you are working at the value level, there will be absolutely no difference in the results you obtain regardless of whether your machine is little-endian or big-endian. I.e. as long as you are using language-level operators (like | and << in your example), you will get exactly the same arithmetical result from the above expression on any platform. The endianness of the machine is not detectable and not visible at this level.
The only situations when you need to care about endianness is when the data you are working with is examined at the object representation level, i.e. in situations when its raw memory representation is important. What you said above about "AB BC CD DE instead of DE CD BC AB" is specifically about the raw memory layout of the data. That's what functions like ntonl do: they convert one memory layout to another memory layout. So far you gave no indication that the actual raw memory layout is in any way important to you. Is it?
Again, if you only care about the value of the above expression, it is fully and totally endianness-independent. Basically, you are not supposed to care about endianness at all when you write C programs that don't attempt to access and examine the raw memory contents.
although would it affect the my operation completely in that I will be shifting incorrectly (?)
No.
The result will be the same regardless of the endian architecture. Bit shifting and twiddling are just like regular arithmetic operations. Is 2 + 2 the same on little endian and big endian architectures? Of course. 2 << 2 would be the same as well.
Little and big endian problems arise when you are dealing directly with the memory. You will run into problems when you do the following:
char bytes[] = {1, 0, 0, 0};
int n = *(int*)bytes;
On little endian machines, n will equal 0x00000001. On big endian machines, n will equal 0x01000000. This is when you will have to swap the bytes around.
[Rewritten for clarity]
ntohl (and ntohs, etc.) is used primarily for moving data from one machine to another. If you're simply manipulating data on one machine, then it's perfectly fine to do bit-shifting without any further ceremony -- bit-shifting (at least in C and C++) is defined in terms of multiplying/dividing by powers of 2, so it works the same whether the machine is big-endian or little-endian.
When/if you need to (at least potentially) move data from one machine to another, it's typically sensible to use htonl before you send it, and ntohl when you receive it. This may be entirely nops (in the case of BE to BE), two identical transformations that cancel each other out (LE to LE), or actually result in swapping bytes around (LE to BE or vice versa).
FWIW, I think a lot of what has been said here is correct. However, if the programmer has coded with endianness in mind, say using masks for bitwise inspection and manipulation, then cross-platform results could be unexpected.
You can determine 'endianness' at runtime as follows:
#define LITTLE_ENDIAN 0
#define BIG_ENDIAN 1
int endian() {
int i = 1;
char *p = (char *)&i;
if (p[0] == 1)
return LITTLE_ENDIAN;
else
return BIG_ENDIAN;
}
... and proceed accordingly.
I borrowed the code snippet from here: http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?ca=drs- where there is also an excellent discussion of these issues.
hth -
Perry
I'm working with embedded C for the first time. Although my C is rusty, I can read the code but I don't really have a grasp on why certain lines are the way the are. For example, I want to know if a variable is true or false and send it back to another application. Rather than setting the variable to 1 or 0, the original implementor chose 0xFF.
Is he trying to set it to an address space? or else why set a boolean variable to be 255?
0xFF sets all the bits in a char.
The original implementer probably decided that the standard 0 and 1 wasn't good enough and decided that if all bits off is false then all bits on is true.
That works because in C any value other than 0 is true.
Though this will set all bytes in a char, it will also work for any other variable type, since any one bit being set in a variable makes it true.
If you are in desperate need of memory, you might want to store 8 booleans in one byte (or 32 in a long, or whatever)
This can easily be done by using a flag mask:
// FLAGMASK = ..1<<n for n in 0..7...
FLAGMASK = 0x10; // e.g. n=4
flags &= ~FLAGMASK; // clear bit
flags |= FLAGMASK; // set bit
flags ^= FLAGMASK; // flip bit
flags = (flags & ~FLAGMASK) | (booleanFunction() & FLAGMASK); // clear, then maybe set
this only works when booleanFunction() returns 0 (all bits clear) or -1 (all bits set).
0xFF is the hex representation of ~0 (i.e. 11111111)
In, for example, VB and Access, -1 is used as True.
These young guys, what do they know?
In one of the original embedded languages - PL/M (-51 yes as in 8051, -85, -86, -286, -386) - there was no difference between logical operators (!, &&, || in C) and bitwise (~, &, |, ^). Instead PL/M has NOT, AND, OR and XOR taking care of both categories. Are we better off with two categories? I'm not so sure. I miss the logical ^^ operator (xor) in C, though. Still, I guess it would be possible to construct programs in C without having to involve the logical category.
In PL/M False is defined as 0. Booleans are usually represented in byte variables. True is defined as NOT False which will give you 0ffh (PL/M-ese for C's 0xff).
To see how the conversion of the status flag carry took place defore being stored in a byte (boolean wasn't available as a type) variable, PL/M could use the assembly instruction "sbb al,al" before storing. If carry was set al would contain 0ff, if it wasn't it would contain 0h. If the opposite value was required, PL/M would insert a "cmc" before the sbb or append a "not al" after (actually xor - one or the other).
So the 0xff for TRUE is a direct compatibility port from PL/M. Necessary? Probably not, unless you're unsure of your skills (in C) AND playing it super safe.
As I would have.
PL/M-80 (used for the 8080, 8085 and Z80) did not have support for integers or floats, and I suspect it was the same for PL/M-51. PL/M-86 (used for the 8086, 8088, 80188 and 80186) added integers, single precision floating point, segment:offset pointers and the standard memory models small, medium, compact and large. For those so inclined there were special directives to create do-it-yourself hybrid memory models. Microsoft's huge memory model was equivalent to intel's large. MS also sported tiny, small, compact, medium and large models.
Often in embedded systems there is one programmer who writes all the code and his/her idiosyncrasies are throughout the source. Many embedded programmers were HW engineers and had to get a system running as best they could. There was no requirement nor concept of "portability". Another consideration in embedded systems is the compiler is specific for the CPU HW. Refer to the ISA for this CPU and check all uses of the "boolean".
As others have said, it's setting all the bits to 1. And since this is embedded C, you might be storing this into a register where each bit is important for something, so you want to set them all to 1. I know I did similar when writing in assembler.
What's really important to know about this question is the type of "var". You say "boolean", but is that a C++/C99's bool, or is it (highly likely, being an embedded C app), something of a completely different type that's being used as a boolean?
Also adding 1 to 0xff sets it to 0( assuming unsigned char) and the checking might have been in a loop with an increment to break.
Here's a likely reason: 0xff is the binary complement of 0. It may be that on your embedded architecture, storing 0xff into a variable is more efficient than storing, say, 1 which might require extra instructions or a constant stored in memory.
Or perhaps the most efficient way to check the "truth value" of a register in your architecture is with a "check bit set" instruction. With 0xff as the TRUE value, it doesn't matter which bit gets checked... they're all set.
The above is just speculation, of course, without knowing what kind of embedded processor you're using. 8-bit, 16-bit, 32-bit? PIC, AVR, ARM, x86???
(As others have pointed out, any integer value other than zero is considered TRUE for the purposes of boolean expressions in C.)