Bitwise permutation of multiple 64bit values in parallel / combined

Bitwise permutation of multiple 64bit values in parallel / combined - c

This question is NOT about "How do i bitwise permutation" We now how to do that, what we are looking for is a faster way with less cpu instructions, inspired by the bitslice implementation of sboxes in DES
To speed up some cipher code we want to reduce the amount of permutation calls. The main cipher functions do multiple bitwise permutations based on lookup arrays. As the permutation operations are only bitshifts,
Our basic idea is to take multiple input values, that need the same permutation, and shift them in parallel. For example, if input bit 1 must be moved to output bit 6.
Is there any way to do this? We have no example code right now, because there is absolutly no idea how to accomplish this in a performant way.
The maximum value size we have on our plattforms are 128bit, the longest input value is 64bit.Therefore the code must be faster, then doing the whole permutation 128 times.
EDIT
Here is a simple 8bit example of a permutation
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Bits
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Input
+---+---+---+---+---+---+---+---+
| 3 | 8 | 6 | 2 | 5 | 1 | 4 | 7 | <= Output
+---+---+---+---+---+---+---+---+
The cipher makes usage of multiple input keys. It's a block cipher, so the same pattern must be applied to all 64bit blocks of the input.
As the permutations are the same for each input block, we want to process multiple input blocks in one step / to combine the operations for multiple input sequences. Instead of moving 128times one bit per call, moving 1 time 128bit at once.
EDIT2
We could NOT use threads, as we have to run the code on embedded systems without threading support. Therefore we also have no access on external libraries and we have to keep it plain C.
SOLUTION
After testing and playing with the given answers we have done it the following way:
We are putting the single bits of 128 64bit values on a uint128_t[64]* array.
For permutation we have just to copy pointers
After all is done, we revert the first operation and get 128 permuted values back
Yeah, it is realy that simple. We was testing this way early in the project, but it was too slow. It seems we had a bug in the testcode.
Thank you all, for the hints and the patience.

You could make Stan's bit-by-bit code faster by using eight look-up tables mapping bytes to 64-bit words. To process a 64-bit word from input, split it into eight bytes and look up each from a different look-up table, then OR the results. On my computer the latter is 10 times faster than the bit-by-bit approach for 32-bit permutations. Obviously if your embedded system has little cache, then 32 kB 16 kB of look-up tables may be a problem. If you process 4 bits at a time, you only need 16 look-up tables of 16*8=128 bytes each, i.e. 2 kB of look-up tables.
EDIT: The inner loop could look something like this:
void permute(uint64_t* input, uint64_t* output, size_t n, uint64_t map[8][256])
{
for (size_t i = 0; i < n; ++i) {
uint8_t* p = (uint8_t*)(input+i);
output[i] = map[0][p[0]] | map[1][p[1]] | map[2][p[2]] | map[3][p[3]]
| map[4][p[4]] | map[5][p[5]] | map[6][p[6]] | map[7][p[7]];
}
}

I think you might be looking for a bit-slicing implementation. This is how the fastest DES-cracking impelmentations work. (Or it was before SSE instructions existed, anyway.)
The idea is to write your function in a "bit-wise" manner, representing each output bit as a Boolean expression over the input bits. Since each output bit depends only on the input bits, any function can be represented this way, even things like addition, multiplication, or S-box lookups.
The trick is to use the actual bits of a single register to represent a single bit from multiple input words.
I will illustrate with a simple four-bit function.
Suppose, for example, you want to take four-bit inputs of the form:
x3 x2 x1 x0
...and for each input, compute a four-bit output:
x2 x3 x2^x3 x1^x2
And you want to do this for, say, eight inputs. (OK for four bits a lookup table would be fastest. But this is just to illustrate the principle.)
Suppose your eight inputs are:
A = a3 a2 a1 a0
B = b3 b2 b1 b0
...
H = h3 h2 h1 h0
Here, a3 a2 a1 a0 represent the four bits of the A input, etc.
First, encode all eight inputs into four bytes, where each byte holds one bit from each of the eight inputs:
X3 = a3 b3 c3 d3 e3 f3 g3 h3
X2 = a2 b2 c2 d2 e2 f2 g2 h2
X1 = a1 b1 c1 d1 e1 f1 g1 h1
X0 = a0 b0 c0 d0 e0 f0 g0 h0
Here, a3 b3 c3 ... h3 is the eight bits of X3. It consists of the high bits of all eight inputs. X2 is the next bit from all eight inputs. And so on.
Now to compute the function eight times in parallel, you just do:
Y3 = X2;
Y2 = X3;
Y1 = X2 ^ X3;
Y0 = X1 ^ X2;
Now Y3 holds the high bits from all eight outputs, Y2 holds the next bit from all eight outputs, and so on. We just computed this function on eight different inputs using only four machine instructions!
Better yet, if our CPU is 32-bit (or 64-bit), we could compute this function on 32 (or 64) inputs, still using only four instructions.
Encoding the input and decoding the output to/from the "bit slice" representation takes some time, of course. But for the right sort of function, this approach offers massive bit-level parallelism and thus a massive speedup.
The basic assumption is that you have many inputs (like 32 or 64) on which you want to compute the same function, and that the function is neither too hard nor too easy to represent as a bunch of Boolean operations. (Too hard makes the raw computation slow; too easy makes the time dominated by the bit-slice encoding/decoding itself.) For cryptography in particular, where (a) the data has to go through many "rounds" of processing, (b) the algorithm is often in terms of bits munging already, and (c) you are, for example, trying many keys on the same data... It often works pretty well.

It seems difficult to do the permutation in only one call. A special case of your problem, reversing bits in an integer, needs more than one 'call' (what do you mean by call?). See Bit Twiddling Hacks by Sean for information of this example.
If your mapping pattern is not complicated, maybe you can find a fast way to calculate the answer:) However, I don't know whether you like this direct way:
#include <stdio.h>
unsigned char mask[8];
//map bit to position
//0 -> 2
//1 -> 7
//2 -> 5
//...
//7 -> 6
unsigned char map[8] = {
2,7,5,1,4,0,3,6
};
int main()
{
int i;
//input:
//--------------------
//bit 7 6 5 4 3 2 1 0
//--------------------
//val 0 0 1 0 0 1 1 0
//--------------------
unsigned char input = 0x26;
//so the output should be 0xA1:
// 1 0 1 0 0 0 0 1
unsigned char output;
for(i=0; i<8; i++){ //initialize mask once
mask[i] = 1<<i;
}
//do permutation
output = 0;
for(i=0; i<8; i++){
output |= (input&mask[i])?mask[map[i]]:0;
}
printf("output=%x\n", output);
return 0;
}

Your best bet would be to look into some type of threading scheme ... either you can use a message-passing system where you send each block to a fixed set of worker threads, or you can possibly setup a pipeline with non-locking single producer/consumer queues that perform multiple shifts in a "synchronous" manner. I say "synchronous" because a pipeline on a general-purpose CPU would not be a truly synchronous pipeline operation like you would have on a fixed-function device, but basically for a given "slice" of time, each thread would be working on one stage of the multi-stage problem at the same time, and you would "stream" the source data into and out of the pipeline.

Related

Packing bits after masking in C

Assume I have a number and I want to interpret every other bit as a new number, e.g.
uint16_t a = 0b1111111000000001;
uint16_t mask = 0xAAAA; // 0b1010101010101010
I now want to be able to get every other bit packed into two 8 bit variables, like
uint8_t b = a & mask ... // = 0b11110000
uint8_t c = a & ~mask ... // = 0b11100001
Is there an efficient way of accomplishing this? I know that I can loop and shift but I am going to do this for a lot of numbers. Even better if I can get both b and c at the same time.

You can precompute some tables if you want to avoid too much shifting.
I do it for a&mask. For the other situation it is identical with a&~mask.
First, you do a& mask to drop the 1's on the unused positions of a.
Suppose you have a=a1 0 a2 0 a3 0 a4 0. You want to get the number a1 a2 a3 a4. There are not many possibilities.
You can have a precomputed vector V of short integers and associate for each entry the corresponding value.
For example, v[0b10100010] will be 13, if the mask is 0b10101010.
If the precomputed vector is not too large it will stay in cache L1, so it will be very fast, for example, if you split your number in groups of 8 or 16 bits.

Structure for an array of bits in C

It has come to my attention that there is no builtin structure for a single bit in C. There is (unsigned) char and int, which are 8 bits (one byte), and long which is 64+ bits, and so on (uint64_t, bool...)
I came across this while coding up a huffman tree, and the encodings for certain characters were not necessarily exactly 8 bits long (like 00101), so there was no efficient way to store the encodings. I had to find makeshift solutions such as strings or boolean arrays, but this takes far more memory.
But anyways, my question is more general: is there a good way to store an array of bits, or some sort of user-defined struct? I scoured the web for one but the smallest structure seems to be 8 bits (one byte). I tried things such as int a : 1 but it didn't work. I read about bit fields but they do not simply achieve exactly what I want to do. I know questions have already been asked about this in C++ and if there is a struct for a single bit, but mostly I want to know specifically what would be the most memory-efficient way to store an encoding such as 00101 in C.

If you're mainly interested in accessing a single bit at a time, you can take an array of unsigned char and treat it as a bit array. For example:
unsigned char array[125];
Assuming 8 bits per byte, this can be treated as an array of 1000 bits. The first 16 logically look like this:
---------------------------------------------------------------------------------
byte | 0 | 1 |
---------------------------------------------------------------------------------
bit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---------------------------------------------------------------------------------
Let's say you want to work with bit b. You can then do the following:
Read bit b:
value = (array[b/8] & (1 << (b%8)) != 0;
Set bit b:
array[b/8] |= (1 << (b%8));
Clear bit b:
array[b/8] &= ~(1 << (b%8));
Dividing the bit number by 8 gets you the relevant byte. Similarly, mod'ing the bit number by 8 gives you the relevant bit inside of that byte. You then left shift the value 1 by the bit number to give you the necessary bit mask.
While there is integer division and modulus at work here, the dividend is a power of 2 so any decent compiler should replace them with bit shifting/masking.

It has come to my attention that there is no builtin structure for a single bit in C.
That is true, and it makes sense because substantially no machines have bit-addressible memory.
But anyways, my question is more general: is there a good way to store
an array of bits, or some sort of user-defined struct?
One generally uses an unsigned char or another unsigned integer type, or an array of such. Along with that you need some masking and shifting to set or read the values of individual bits.
I scoured the
web for one but the smallest structure seems to be 8 bits (one byte).
Technically, the smallest addressible storage unit ([[un]signed] char) could be larger than 8 bits, though you're unlikely ever to see that.
I tried things such as int a : 1 but it didn't work. I read about bit
fields but they do not simply achieve exactly what I want to do.
Bit fields can appear only as structure members. A structure object containing such a bitfield will still have a size that is a multiple of the size of a char, so that doesn't map very well onto a bit array or any part of one.
I
know questions have already been asked about this in C++ and if there
is a struct for a single bit, but mostly I want to know specifically
what would be the most memory-efficient way to store an encoding such
as 00101 in C.
If you need a bit pattern and a separate bit count -- such as if some of the bits available in the bit-storage object are not actually significant -- then you need a separate datum for the significant-bit count. If you want a data structure for a small but variable number of bits, then you might go with something along these lines:
struct bit_array_small {
unsigned char bits;
unsigned char num_bits;
};
Of course, you can make that larger by choosing a different data type for the bits member and, maybe, the num_bits member. I'm sure you can see how you might extend the concept to handling arbitrary-length bit arrays if you should happen to need that.

If you really want the most memory efficiency, you can encode the Huffman tree itself as a stream of bits. See, for example:
https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
Then just encode those bits as an array of bytes, with a possible waste of 7 bits.
But that would be a horrible idea. For the structure in memory to be useful, it must be easy to access. You can still do that very efficiently. Let's say you want to encode up to 12-bit codes. Use a 16-bit integer and bitfields:
struct huffcode {
uint16_t length: 4,
value: 12;
}
C will store this as a single 16-bit value, and allow you to access the length and value fields separately. The complete Huffman node would also contain the input code value, and tree pointers (which, if you want further compactness, can be integer indices into an array).

You can make you own bit array in no time.
#define ba_set(ptr, bit) { (ptr)[(bit) >> 3] |= (char)(1 << ((bit) & 7)); }
#define ba_clear(ptr, bit) { (ptr)[(bit) >> 3] &= (char)(~(1 << ((bit) & 7))); }
#define ba_get(ptr, bit) ( ((ptr)[(bit) >> 3] & (char)(1 << ((bit) & 7)) ? 1 : 0 )
#define ba_setbit(ptr, bit, value) { if (value) { ba_set((ptr), (bit)) } else { ba_clear((ptr), (bit)); } }
#define BITARRAY_BITS (120)
int main()
{
char mybits[(BITARRAY_BITS + 7) / 8];
memset(mybits, 0, sizeof(mybits));
ba_setbit(mybits, 33, 1);
if (!ba_get(33))
return 1;
return 0;
};

decoding data from old school measurement instrument

I am trying to recover raw data from an older measurement instrument, that is interfaced through a printer port.
For example, the instruments software will produce an text output file like this:
S 11/08/08 22:27:58 100 2 U 061
D ___^PR_^_^_]PP_]_^_]_^_____^_^_____^_[_\_\_[_Z_Z_X
D _W_U_T_Q^]^]^Z^V^S^T^S]]]Y]U]R]T]Q]V]Z]\]]^R^]_ZPX
D QSQYQ^RSRYSQSWS\S]SZSWSSSPR\RZRXRTQ^QWQPP[PUPRPQ_^
D _\_]_^_____\_\_Z_X_W_Y_X_X_Z_W_U_V_W_X_[_X_W_W_W
F 2
S 11/08/08 22:35:03 100 2 E 049
D QSQQP_P^QPQPQRQUQUQUQVQZQ[Q\Q]RSR\STSXSWSQR_SQSRR[
D RTQ_QWQUQWQUQZRSSQR]RTRSRQQZQRPZPVPTPTPSPWPTPQPQ_^
D _^_^__PPPPPP__PP__PR__PPPQ_____^_]_]PP_^_]_]_]_Y_^
D ___^_^_\_______^PP__PRPQPPPRPP__PPPP___]_^_^__PP
F 2
The "S" line is all good - provides the appropriate time the measurement
was taken along with some other values.
I'm interested in recovering whatever is hidden in
the "D" lines. The software generates a plot using this data, but
does not provide the raw data.
The only code I have detailing the data encoding contains the comment:
/* Packs the 8-bit data into two 7-bit ASCII chars, encoding the channel
* number into it as well, in the format:
*
* 1CCMMMM and 1CCLLLL, where CC = chn, MMMM/LLLL = Most/Least sig nibble
*/
I can send the actually packing code too if it helps - just trying to keep the
question as small as possible.
Any help - even a point in the right direction would be appreciated...

The encoding is actually pretty clever*: every combination of two letters (2*8 bits or 2*7 bits, depending how you look at it) is a single measurement. The comment tells us how the encoding works. For example, if we take 'QS' as an example:
Pattern: 01CCMMMM 01CCLLLL
Example: 01010001 01010011 = Q S
Channel: ..CC.... ..CC....
..01.... ..01.... = Channel 1
Data: ....0001 ....0011 = 10011 = 19
You simply have to take the bits labeled M and the bits labeled L, put them after each other, treat the whole thing as a single-byte number and you've got the original data. Conversely, extract the bits labeled C to get the channel number.
Here's an example of how you could parse a single measurement, assuming two bytes of input are in a and b:
/* To get the channel, mask with 00110000 = 0x30 then shift */
char channel = (a & 0x30) >> 4;
/* To get data, mask both with 00001111 = 0xF then combine */
char orgdata = ((a & 0xF) << 4) | (b & 0xF);
Putting all that together here gives the following data for the first 'frame' in your example, all on channel 1:
I'm hoping that matches what you're seeing on your plot :)
*: I'm not being sarcastic, either - this encoding packs 10 bits of useful data into 14 bits of usable space, while being a good deal simpler than something like base64 and probably faster.

Need help understanding bitmaps, bitwise operations, and C

Disclaimer: I am asking these questions in relation to an assignment. The assignment itself calls for implementing a bitmap and doing some operations with that, but that is not what I am asking about. I just want to understand the concepts so I can try the implementation for myself.
I need help understanding bitmaps/bit arrays and bitwise operations. I understand the basics of binary and how left/right shift work, but I don't know exactly how that use is beneficial.
Basically, I need to implement a bitmap to store the results of a prime sieve (of Eratosthenes.) This is a small part of a larger assignment focused on different IPC methods, but to get to that part I need to get the sieve completed first. I've never had to use bitwise operations nor have I ever learned about bitmaps, so I'm kind of on my own to learn this.
From what I can tell, bitmaps are arrays of a bit of a certain size, right? By that I mean you could have an 8-bit array or a 32-bit array (in my case, I need to find the primes for a 32-bit unsigned int, so I'd need the 32-bit array.) So if this is an array of bits, 32 of them to be specific, then we're basically talking about a string of 32 1s and 0s. How does this translate into a list of primes? I figure that one method would evaluate the binary number and save it to a new array as decimal, so all the decimal primes exist in one array, but that seems like you're using too much data.
Do I have the gist of bitmaps? Or is there something I'm missing? I've tried reading about this around the internet but I can't find a source that makes it clear enough for me...

Suppose you have a list of primes: {3, 5, 7}. You can store these numbers as a character array: char c[] = {3, 5, 7} and this requires 3 bytes.
Instead lets use a single byte such that each set bit indicates that the number is in the set. For example, 01010100. If we can set the byte we want and later test it we can use this to store the same information in a single byte. To set it:
char b = 0;
// want to set `3` so shift 1 twice to the left
b = b | (1 << 2);
// also set `5`
b = b | (1 << 4);
// and 7
b = b | (1 << 6);
And to test these numbers:
// is 3 in the map:
if (b & (1 << 2)) {
// it is in...

You are going to need a lot more than 32 bits.
You want a sieve for up to 2^32 numbers, so you will need a bit for each one of those. Each bit will represent one number, and will be 0 if the number is prime and 1 if it is composite. (You can save one bit by noting that the first bit must be 2 as 1 is neither prime nor composite. It is easier to waste that one bit.)
2^32 = 4,294,967,296
Divide by 8
536,870,912 bytes, or 1/2 GB.
So you will want an array of 2^29 bytes, or 2^27 4-byte words, or whatever you decide is best, and also a method for manipulating the individual bits stored in the chars (ints) in the array.
It sounds like eventually, you are going to have several threads or processes operating on this shared memory.You may need to store it all in a file if you can't allocate all that memory to yourself.
Say you want to find the bit for x. Then let a = x / 8 and b = x - 8 * a. Then the bit is at arr[a] & (1 << b). (Avoid the modulus operator % wherever possible.)
//mark composite
a = x / 8;
b = x - 8 * a;
arr[a] |= 1 << b;
This sounds like a fun assignment!

A bitmap allows you to construct a large predicate function over the range of numbers you're interested in. If you just have a single 8-bit char, you can store Boolean values for each of the eight values. If you have 2 chars, it doubles your range.
So, say you have a bitmap that already has this information stored, your test function could look something like this:
bool num_in_bitmap (int num, char *bitmap, size_t sz) {
if (num/8 >= sz) return 0;
return (bitmap[num/8] >> (num%8)) & 1;
}

Hash function for 64 bit to 10 bits

I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.

I would say somethig like this:
uint32_t hash(uint64_t x)
{
x >>= 3;
return (x ^ (x>>10) ^ (x>>20)) & 0x3FF;
}
The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them.
Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.
Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.

I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:
20125e8
20125e6
20125e7
20125e4
3fef2131
3fef2130
3fef212f
3fef212c
25e4802
25e4806
What this tells me:
The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.
What I would do next:
/* Drop two LSBs. */
a >>= 2;
/* Get rid of the MSBs. Keep 46 bits. */
a &= 0x3fffffffffff;
/* Get the 14 MSBs and fold them in to get a 32 bit integer.
The MSBs are mostly 0s anyway, so we don't lose much entropy. */
msbs = (a >> 32) << 18;
a ^= msbs;
Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:
uint32_t half_avalanche( uint32_t a)
{
a = (a+0x479ab41d) + (a<<8);
a = (a^0xe4aa10ce) ^ (a>>5);
a = (a+0x9942f0a6) - (a<<14);
a = (a^0x5aedd67d) ^ (a>>3);
a = (a+0x17bea992) + (a<<7);
return a;
}
For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.
I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:
1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions
Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)

Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.
static inline int hashaddress(void *v)
{
return (uintptr_t)v % 1021;
}
If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight