Performing bit level permutations on a quadword - c

I'm looking for the fastest possible way to permutate bits in a 64 bit integer.
Given a table called "array" corresponding to a permutations array, meaning it has a size of 64 and filled with unique numbers (i.e. no repetition) ranging from 0 to 63, corresponding to bit positions in a 64 bit integer, I can permutate bits this way
bit = GetBitAtPos(integer_, array[i]);
SetBitAtPos(integer_, array[i], GetBitAtPos(integer_, i));
SetBitAtPos(integer_, i, bit);
(by looping i from 0 to 63)
GetBitAtPos being
GetBitAtPos(integer_, pos) { return (integer >>) pos & 1 }
Setbitatpos is also founded on the same principle (i.e. using C operators),
under the form SetBitAtPos(integer, position, bool_bit_value)
I was looking for a faster way, if possible, to perform this task. I'm open to any solution, including inline assembly if necessary. I have difficulty to figure a better way than this, so I thought I'd ask.
I'd like to perform such a task to hide data in a 64 bit generated integer (where the 4 first bit can reveal informations). It's a bit better than say a XOR mask imo (unless I miss something), mostly if someone tries to find a correlation.
It also permits to do the inverse operation to not lose the precious bits...
However I find the operation to be a bit costly...
Thanks

Since the permutation is constant, you should be able to come up with a better way than moving the bits one by one (if you're OK with publishing your secret permutation, I can have a go at it). The simplest improvement is moving bits that have the same distance (that can be a modular distance because you can use rotates) between them in the input and output at the same time. This is a very good methods if there are few such groups.
If that didn't work out as well as you'd hoped, see if you can use bit_permute_steps to move all or most of the bits. See the rest of that site for more ideas.
If you can use PDEP and PEXT, you can move bits in groups where the distance between bits can arbitrarily change (but their order can not). It is, afaik, unknown how fast they will be though (and they're not available yet).
The best method is probably going to be a combination of these and other tricks mentioned in other answers.
There are too many possibilities to explore them all, really, so you're probably not going to find the best way to do the permutation, but using these ideas (and the others that were posted) you can doubtlessly find a better what than you're currently using.
PDEP and PEXT have been available for a while now so their performance is known, at 3 cycle latency and 1/cycle throughput they're faster than most other useful permutation primitives (except trivial ones).

Split your bits into subsets where this method works:
Extracting bits with a single multiplication
Then combine the results using bitwise OR.

For 64-bit number I believe the problem (of finding best algorithm) may be unsolvable due to huge amount of possibilities. One of the most scalable and easiest to automatize would be look up table:
result = LUT0[ value & 0xff] +
LUT1[(value >> 8) & 0xff] +
LUT2[(value >> 16) & 0xff] + ...
+ LUT7[(value >> 56) & 0xff];
Each LUT entry must be 64-bit wide and it just spreads each 8 bits in a subgroup to the full range of 64 possible bins. This configuration uses 16k of memory.
The scalability comes from the fact that one can use any number of look up tables (practical range from 3 to 32?). This method is vulnerable to cache misses and it can't be parallelized (for large table sizes at least).
If there are certain symmetries, there are some clever trick available --
e.g. swapping two bits in Intel:
test eax, (1<<BIT0 | 1<<BIT1)
jpe skip:
xor eax, (1<<BIT0 | 1<<BIT1)
skip:
This OTOH is highly vulnerable to branch mispredictions.

Related

bitwise - how are the bitmasks operations implemented?

Context
I am using a lot of bitwise operations but I don't even know how they are implemented at the lowest level possible.
I would like to see how the intel/amd devs achieve to implement such operations. Not to replace them in my code, that would be dumb.. But to get a broader catch on what is going on.
I tried to find some info but most of the time, people ask about its use or to replace it with other bitwise operations, which is not the case here.
Questions
Is it doing basic iterations in assembly(sse) over the 32 bits and compare ?
Are there some tricks to get it up to speed ?
Thanks
Most all are implemented directly on the CPU, as basic, native instructions, not part of SSE. These are the oldest, most basic operations on the CPU register.
As to how and, or, xor, etc. are implemented, if you are really interested, look up digital logic design, or discrete math. Lookup up Flip-flops, AND gates, or NAND / NOR / XOR gates
https://en.wikipedia.org/wiki/NAND_logic
Also lookup K-maps (Karnaugh maps), these are what you can use to implement a logic circuit by hand.
https://en.wikipedia.org/wiki/Karnaugh_map
If you really enjoy the reading, you can signup for a digital logic design class if you have access to an engineering or computer science university. You will get to build logic circuits with large ICs on a breadboard, but nowadays most CPUs are "written" with code, like software, and "printed" on a silicon wafer.
Of particular interest is NAND and NOR due to their functional completeness (you can use NAND or NOR to construct any truth table).
NAND (logical symbol looks like =Do-)
A
=Do- Q is Q = NOT(A AND B)
B
Truth table
A B Q
0 0 1
0 1 1
1 0 1
1 1 0
You can rewrite any logic with NAND.
As you can also see, its pretty efficient, you can't get any lower level than a single gate with binary (though there is ternary / tri-state logic), so its a single clock state change. So for a 64-bit CPU register, you'll need 64 of these babies side by side, PER register... PER core... PER instruction. And that is only the "logical" registers. Because advanced processors (like Intel Core) do register renaming, you have more physical registers in silicon than logically available to you by name.
AND, OR, XOR, and NOT operations are implemented quite efficiently in silicon, and so are generally a single-cycle native instruction on most processors. That is, for a 16-bit processor, whole 16-bit registers are ANDed at once; on a 32-bit processor, 32 bits at once, etc. The only performance issue you might want to be aware of is alignment: on an ARM processor, for example, if a 32-bit value starts at a memory address that is a multiple of 4, then a read-modify-write can be done in two or three cycles. If it's at an odd address, it has to do two reads at the neighboring aligned addresses and two writes, and so is slower.
Bit shifting in some older processors may involve looping over single shifts. That is, 1 << 5 will take longer than 1 << 2. But most modern processors have what is called a "barrel shifter" that equalizes all shifts up to the register size, so on a Pentium, 1 << 31 takes no longer than 1 << 2.
Addition and subtraction are fast primitives as well. Multiplication and division are tricky: these are mostly implemented as microcode loops. Multiplication can be sped up by unrolling the loops into huge swaths of silicon in a high-end processor, but division cannot, so generally division is the slowest basic operation in a microprocessor.
Bitwise operations are what processors are made of, so it is natural to expose those operations with instructions. Operations like AND, OR, XOR, NOR, NAND and NOT can be performed by the ALU with only a few logic gates per bit. Importantly, each bit of the result only relies on two bits of the input (unlike multiplication or addition), so the entire operation can proceed in parallel without any complication.
As you know, data in computers is represented in a binary format.
For example, if you have the integer 13 it's represented as 1101b (where b means binary). This works out to (1) * 8 + (1) * 4 + (0) * 2 + (1) * 1 = 13, just like (1) * 10 + (3) * 1 = 13 -- different bases.
However, for basic operations computers need to know how much data you're working with. A typical integer size is 32 bits. So it's not just 1101b, it's 00000000000000000000000000001101b -- 32 bits, most of them unused.
Bitwise operations are just that -- they operate only on a bit level. Adding, multiplying, and other operations consider multiple bits at a time to perform their function, but bitwise operators do not. For example:
What's 12 bitwise-and 7? (in C vernacular, 12 & 7)
1010b 12 &
0111b 7
----- =
0010n 2
Why? Think vertically! Look at the left set of digits -- 1 and 0 is 0. Then, 0 and 1 is 0. Then, 1 and 1 is 1. Finally, 0 and 1 is 0.
This is based on the and truth table that states these rules -- that only true (aka 1) and true (aka 1) results in false (aka 0). All other resultant values are false (aka 0).
Likewise, the or truth table states that all results are true (aka 1) except for false (aka 0) and false (aka 0) which results in false (aka 0).
Let's do the same example, but this time let's computer 12 bitwise-or 7. (Or in C vernacular, 12 | 7)
1010b 12 |
0111b 7
----- =
1111n 15
And finally, let's consider one other principal bitwise operator: not. This is a unary operator where you simply flip each bit. Let's compute bitwise-not 7 (or in C vernacular, ~7)
0111b ~7
----- =
1000b 8
But wait.. What about all those leading zeroes? Well, yes, before I was omitting them because they weren't important, but now they surely are:
00000000000000000000000000000111b ~7
--------------------------------- =
11111111111111111111111111111000b ... big number?
If you're instructing the computer to treat the result as an unsigned integer (32-bit), that's a really big number. (Little less than 4 billion). If you're instructing the computer to treat the result as a signed integer (32-bit) that's -8.
As you may have guessed, since the logic is really quite simple for all these operations, there's not much you can do to make them individually faster. However, bitwise operations obey the same logic as boolean logic, and thus you can use boolean logic reduction techniques to reduce the number of bitwise operations you may need.
e.g. (A & B) | (A & C) results in the same as A & (B | C)
However, that's a much larger topic. Karnaugh maps are one technique, but boolean algebra is usually what I end up using while programming.

Using bitwise operations

How often you use bitwise operation "hacks" to do some kind of
optimization? In what kind of situations is it really useful?
Example: instead of using if:
if (data[c] >= 128) //in a loop
sum += data[c];
you write:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
Of course assuming it does the same intended result for this specific situation.
Is it worth it? I find it unreadable. How often do you come across
this?
Note: I saw this code here in the chosen answers :Why is processing a sorted array faster than an unsorted array?
While that code was an excellent way to show what's going on, I usually wouldn't use code like that. If it had to be fast, there are usually even faster solutions, such as using SSE on x86 or NEON on ARM. If none of that is available, sure, I'll use it, provided it helps and it's necessary.
By the way, I explain how it works in this answer
Like Skylion, one thing I've used a lot is figuring out whether a number is a power of two. Think a while about how you'd do that.. then look at this: (x & (x - 1)) == 0 && x != 0
It's tricky the first time you see it, I suppose, but once you get used to it it's just so much simpler than any alternative that doesn't use bitmath. It works because subtracting 1 from a number means that the borrow starts at the rightmost end of the number and runs through all the zeroes, then stops at the first 1 which turns into a zero. ANDing that number with the original then makes the rightmost 1 zero. Powers of two only have one 1, which disappears, leaving zero. All other numbers will have at least one 1 left, except zero, which is a special case. A common variant doesn't test for zero, and is OK with treating it as power of two or knows that zero can't happen.
Similarly there are other things that you can easily do with bitmath, but not so easy without. As they say, use the right tool for the job. Sometimes bitmath is the right tool.
Bitwise operations are so useful that prof. Knuth wrote a book abot them: http://www.amazon.com/The-Computer-Programming-Volume-Fascicle/dp/0321580508
Just to mention a few simplest ones: int multiplication and division by a power of two (using left and right shift), mod with respect to a power of two, masking and so on. When using bitwise ops just be sure to provide sufficient comments about what's going on.
However, your example, data[c]>128 is not applicable IMO, just keep it that way.
But if you want to compute data[c] % 128 then data[c] & 0x7f is much faster (where & represents bitwise AND).
There are several instances where using such hacks may be useful. For instance, they can remove some Java Virtual Machine "Optimizations" such as branch predictors. I have found them useful only once in a few cases. The main one is multiplying by -1. If you are doing it hundreds of times across a massive array it is more efficient to simply flip the first bit, than to actually multiple. Another example I have used it is to know if a number is a power of 2 (since it's so easy to figure out in binary.) Basically, bit hacks are useful when you want to cheat. Here is a human analogy. If you have list of numbers and you need to know if they are greater than 29, You can automatically know if the first digit is larger than 3, then the whole thing is larger than 30 an vice versa. Bitwise operations simply allow you to perform similar cheats to binary.

x86-64 integer vectorisation optimise

I am trying to vectorize a logical validation problem to run on Intel 64.
I will first try to describe the problem:
I have a static array v[] of 70-bit integers (appx 400,000 of them) which are all known at compile time.
A producer creates 70-bit integers a, a lot of them, very quickly.
For each a I need to find out if there exists an element from v for which v[i] & a == 0.
So far my implementation in C is something like this (simplified):
for (; *v; v++) {
if (!(a & *v))
return FOUND;
}
// a had no matching element in v
return NOT_FOUND;
I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a and *v into an XMM register each and calling the PTEST instruction to do the validation.
I am wondering if there is a way to expand this to use all 256 bits of the new YMM registers?
Maybe packing 3x70 bits into a single register?
I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.
A couple things that we know about the nature of the input:
All elements in v[] have very few bits set
It is not possible to permute/compress v[] in any way to make it use less then 70 bits
The FOUND condition is expected to be satisfied after checking appx 20% on v[] on average.
It is possible to buffer more then one a before checking them in a batch.
I do not necessarily need to know which element of v[] matched, only that one did or not.
Producing a requires very little memory, so anything left in L1 from the previous call is likely to still be there.
The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions.
I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.
This sounds like you what you really need is a better data structure to store the v[], so that searches take less than linear time.
Consider that if (v[0] & v[1]) & a is not zero, then neither (v[0] & a) nor (v[1] & a) can be zero. This means it is possible to create a tree structure where the v[] are the leaves, and the parent nodes are the AND combination of their children. Then, if parentNode & a gives you a non-zero value, you can skip looking at the children.
However, this isn't necessarily helpful - the parent node only ends up testing the bits common between the children, so if there are only a few of those, you still end up testing lots of leave nodes. But if you can find clusters in your data set and group many similar v[] under a common parent, this may drastically reduce the number of comparisons you have to do.
On the other hand, such a tree search involves a lot of conditional branches (expensive), and would be hard to vectorize. I'd first try if you can get away with just two levels: first do a vectorized search among the cluster parent nodes, then for each match do a search for the entries in that cluster.
Actually here's another idea, to help with the fact that 70 bits don't fit well into registers:
You could split v[] into 64 (=2^6) different arrays. Of the 70 bits in the original v[], the 6 most significant bits are used to determine which array will contain the value, and only the remaining 64 bits are actually stored in the array.
By testing the mask a against the array indices, you will know which of the 64 arrays to search (in the worst case, if a doesn't have any of the 6 highest bits set, that'll be all of them), and each individual array search deals only with 64 bits per element (much easier to pack).
In fact this second approach could be generalized into a tree structure as well, which would give you some sort of trie.

Practical applications of bit shifting

I totally understand how to shift bits. I've worked through numerous examples on paper and in code and don't need any help there.
I'm trying to come up with some real world examples of how bit shifting is used. Here are some examples I've been able to come up with:
Perhaps the most important example I could conceptualize had to do with endianness. In big endian systems, least significant bits are stored from the left, and in little endian systems, least significant bits are stored from the right. I imagine that for files and networking transmissions between systems which use opposite endian strategies, certain conversions must be made.
It seems certain optimizations could be made by compilers and processors when dealing with any multiplications that are n^2, n^4, etc. The bits are just being shifted to the left. (Conversly, I suppose the same would apply for division, n/2, n/4, etc.)
In encryption algorithms. Ie using a series of bit shifts, reverses and combinations to obfuscate something.
Are all of these accurate examples? Is there anything you would add? I've spent quite a bit of time learning about how to implement bit shifting / reordering / byte swapping and I want to know how it can be practically applied = )
I would not agree that the most important example is endianness but it is useful. Your examples are valid.
Hash functions often use bitshifts as a way to get a chaotic behavior; not dissimilar to your cryptographic algorithms.
One common use is to use an int/long as a series of flag values, that can be checked, set, and cleared by bitwise operators.
Not really widely used, but in (some) chess games the board and moves are represented with 64 bit integer values (called bitboards) so evaluating legal moves, making moves, etc. is done with bitwise operators. Lots of explanations of this on the net, but this one seems like a pretty good explanation: http://www.frayn.net/beowulf/theory.html#bitboards.
And finally, you might find that you need to count the number of bits that are set in an int/long, in some technical interviews!
The most common example of bitwise shift usage I know is for setting and clearing bits.
uint8_t bla = INIT_VALUE;
bla |= (1U << N); // Set N-th bit
bla &= ~(1U << N); // Clear N-th bit
Quick multiplication and division by a power of 2 - Especially important in embedded applications
CRC computation - Handy for networks e.g. Ethernet
Mathematical calculations that requires very large numbers
Just a couple off the top of my head

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Resources