I'm writing a Langton's ant sim (for rulestring RLR) and am trying to optimise it for speed. Here's the pertinent code as it stands:
#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
uint_fast8_t* state;
uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
uint_fast8_t ant_orientation=0;
uint_fast8_t two_pow_five=32;
uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
... initialise empty state
while(1)
{
while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
{
while(--two_pow_thirty_two)
{
/*one iteration*/
/* 54 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
state[ant] = (36 >> (state[ant] *2) ) & 3;
ant+=move_ant[ant_orientation];
*/
/* 47 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
state[ant] += (state[ant]==2)?-2:1;
ant+=move_ant[ant_orientation];
*/
/* 46 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
if(state[ant]==2)
{
--state[ant];
--state[ant];
}
else
++state[ant];
ant+=move_ant[ant_orientation];
*/
/* 44 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
if(state[ant]==3)state[ant]=0;
ant+=move_ant[ant_orientation];
*/
// 37 seconds for init + 2^32 steps
// handle every situation with nested switches and constants
switch(ant_orientation)
{
case 0:
switch(state[ant])
{
case 0:
ant_orientation=1;
state[ant]=1;
++ant;
break;
case 1:
ant_orientation=3;
state[ant]=2;
--ant;
break;
case 2:
ant_orientation=1;
state[ant]=0;
++ant;
break;
}
break;
case 1:
switch(state[ant])
{
...
}
break;
case 2:
switch(state[ant])
{
...
}
break;
case 3:
switch(state[ant])
{
...
}
break;
}
}
}
two_pow_five=32;
... dump to file every 2^37 steps
}
return 0;
}
I have two questions:
I've tried to optimise as best as I can with c by trial and error testing, are there any tricks I haven't taken advantage of? Please try to talk in c not assembly, although I'll probably try assembly at some point.
Is there a better way to model the problem to increase speed?
More info: Portability doesn't matter. I'm on 64 bit linux, using gcc, an i5-2500k and 16 GB of ram. The state array as it stands uses 4GiB, the program could feasibly use 12GiB of ram. sizeof(uint_fast8_t)=1. Bounds checks are intentionally not present, corruption is easy to spot manually from the dumps.
edit: Perhaps counter-inuitively, piling on the switch statements instead of eliminating them has yielded the best efficiency so far.
edit: I've re-modelled the problem and come up with something quicker than a single step per iteration. Before, each state element used two bits and described a single cell in the Langton's ant grid. The new way uses all 8 bits, and describes a 2x2 section of the grid. Every iteration a variable number of steps are done, by looking up pre-computed values of step count, new orientation and new state for the current state+orientation. Assuming everything is equally likely it averages to 2 steps taken per iteration. As a bonus it uses 1/4 of the memory to model the same area:
while(--iteration)
{
// roughly 31 seconds per 2^32 steps
table_offset=(state[ant]*24)+(ant_orientation*3);
it+=twoxtwo_table[table_offset+0];
state[ant]=twoxtwo_table[table_offset+2];
ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}
Haven't tried optimising it yet, the next thing to try is eliminating the offset equation and lookups with nested switches and constants like before (but with 648 inner cases instead of 12).
Or, you can use a single unsigned byte constant as an artificial register instead of branching:
value: 1 3 1 1
bits: 01 11 01 01 ---->101 decimal value for an unsigned byte
index 3 2 1 0 ---> get first 2 bits to get "1" (no shift)
--> get second 2 bits to get "1" (shifting for 2 times)
--> get third 2 bits to get "3" (shifting for 4 times)
--> get last 2 bits to get "1" (shifting for 6 times)
Then "AND" the result with binary(11) or decimal(3) to get your value.
(101>>( (++state[ant])*2 ) ) & 3 would give you the turnright or turnleft
Example:
++state[ant]= 0: ( 101>>( (0)*2 ) )&3 --> 101 & 3 = 1
++state[ant]= 1: ( 101>>( (1)*2 ) )&3 --> 101>>2 & 3 = 1
++state[ant]= 2: ( 101>>( (2)*2 ) )&3 --> 101>>4 & 3 = 3 -->turn left
++state[ant]= 3: ( 101>>( (3)*2 ) )&3 --> 101>>6 & 3 = 1
Maximum six-shifting + one-multiplication + one-"and" may be better.
Dont forget constant can be auto-promoted so you may add some suffixes or something else.
Since you are using "unsigned int" for the %4 modulus, you can use "and" operation.
state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;
For unskilled compilers, this should increase speed.
The hardest part: modulo-3
C = A % B is equivalent to C = A – B * (A / B)
We need state[ant]%3
Result = state[ant] - 3 * (state[ant]/3)
state[ant]/3 is always <=1 for your valid direction states.
Only when state[ant] is 3 then state[ant]/3 is 1, other values give 0.
When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
Result = state[ant] - (0 or 3)
Lets look at all possibilities:
state[ant]=0: 0 - 0 ---> 0 ----> 00100100 shifted by 0 times &3 --> 00000000
state[ant]=1: 1 - 0 ---> 1 ----> 00100100 shifted by 2 times &3 --> 00000001
state[ant]=2: 2 - 0 ---> 2 ----> 00100100 shifted by 4 times &3 --> 00000010
state[ant]=3: 3 - 3 ---> 0 ----> 00100100 shifted by 6 times &3 --> 00000000
00100100 is 36 in decimal.
(36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)
Example:
state[ant]=0: 36 >> 0 --> 36 ----> 36& 3 ----> 0 satisfies 0%3
state[ant]=1: 36 >> 2 --> 9 -----> 9 & 3 ----> 1 satisfies 1%3
state[ant]=2: 36 >> 4 --> 2 -----> 2 & 3 ----> 2 satisfies 2%3
state[ant]=3: 36 >> 6 --> 0 -----> 0 & 3 ----> 0 satisfies 3%3
Related
Edit: I wish SO let me accept 2 answers because neither is complete without the other. I suggest reading both!
I am trying to come up with a fast implementation of a function that given an unsigned 32-bit integer x returns the sum of 2^trailing_zeros(i) for i=1..x-1, where trailing_zeros is the count trailing zeros operation which is defined as returning the 0 bits after the least significant 1 bit. This seems like the kind of problem that should lend itself to a clever bit manipulation implementation that takes the same number of instructions regardless of the input, but I haven't been able to derive it.
Mathematically, 2^trailing_zeros(i) is equivalent to the largest factor of 2 that exactly divides i. So we are summing those largest factors for 1..x-1.
i | 1 2 3 4 5 6 7 8 9 10
-----------------------------------------------------------------------
2^trailing_zeroes(i) | 1 2 1 4 1 2 1 8 1 2
-----------------------------------------------------------------------
Sum (desired value) | 0 1 3 4 8 9 11 12 20 21
It is a little easier to see the structure of 2^trailing_zeroes(i) if we 'plot' the values -- horizontal position increasing from left to right corresponding to i and vertical position increasing from top to bottom corresponding to trailing_zeroes(i).
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
64 64
Here it is easier to see the pattern that 2's are always 4 apart, 8's are always 16 apart, etc. However, each pattern starts at a different time -- 8's don't begin until i=8, 16 doesn't begin until i=16, etc. If you don't take into account that the patterns don't start right away you can come up with formulas that don't work -- for example you might think to determine the number of 8's going into the total you should just compute floor(x/16) but i=25 is far enough to the right to include both of the first two 8s.
The best solution I have come up with so far is:
Set n = floor(log2(x)). This can be computed quickly using the count leading zeros operation. This tells us the highest power of two that is going to be involved in the sum.
Set sum = 0
for i = 1..n
sum += floor((x - 2^i) / 2^(i+1))*2^i + 2^i
The way this works as for each power, it calculates the horizontal distance on the plot between x and the first appearance of that power, e.g. the distance between x and the first 8 is (x-8), and then it divides by the distance between repeating instances of that power, e.g. floor((x-8)/16), which gives us how many times that power appeared, we the sum for that power, e.g. floor((x-8)/16)*8. Then we add one instance of the given power because that calculation excludes the very first time that power appears.
In practice this implementation should be pretty fast because the division/floor can be done by right bit shift and powers of two can be done with 1 bit-shifted to the left. However it seems like it should still be possible to do better. This implementation will loop more for larger inputs, up to 32 times (it's O(log2(n)), ideally we want O(1) without a gigantic lookup table using up all the CPU cache). I've been eyeing the BMI/BMI2 intrinsics but I don't see an obvious way to apply them.
Although my goal is to implement this in a compiled language like C++ or Rust with real bit shifting and intrinsics, I've been prototyping in Python. Included below is my script that includes the implementation I described, z(x), and the code for generating the plot, tower(x).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from math import pow, floor, log, ceil
def leading_zeros(x):
return len(bin(x).split('b')[-1].split('1')[-1])
def f(x):
s = 0
for c, i in enumerate(range(1,x)):
a = pow(2, len(bin(i).split('b')[-1].split('1')[-1]))
s += a
return s
def g(x): return sum([pow(2,i)*floor((x+pow(2,i)-1)/pow(2,i+1)) for i in range(0,32)])
def h(x):
s = 0
extra = 0
extra_s = 0
for i in range(0,32):
num = (x+pow(2,i)-1)
den = pow(2,i+1)
fraction = num/den
floored = floor(num/den)
power = pow(2,i)
product = power*floored
if product == 0:
break
s += product
extra += (fraction - floored)
extra_s += power*fraction
#print(f"i={i} s={s} num={num} den={den} fraction={fraction} floored={floored} power={power} product={product} extra={extra} extra_s={extra_s}")
return s
def z(x):
upper_bound = floor(log(x,2)) if x > 0 else 0
s = 0
for i in range(upper_bound+1):
num = (x - pow(2,i))
den = pow(2,i+1)
fraction = num/den
floored = floor(fraction)
added = pow(2,i)
s += floored * added
s += added
print(f"i={i} s={s} upper_bound={upper_bound} num={num} den={den} floored={floored} added={added}")
return s
# return sum([floor((x - pow(2,i))/pow(2,i+1) + pow(2,i)) for i in range(floor(log(x, 2)))])
def tower(x):
table = [[" " for i in range(x)] for j in range(ceil(log(x,2)))]
for i in range(1,x):
p = leading_zeros(i)
table[p][i] = 2**p
for row in table:
for col in row:
print(col,end='')
print()
# h(9000)
for i in range(1,16):
tower(i)
print((i, f(i), g(i), h(i), z(i-1)))
Based on the method of Eric Postpischil, here is a way to do it without a loop.
Note that every bit is being multiplied by its position, and the results are summed (sort of, except there is also a factor of 0.5 in it, let's put that aside for now). Let's call those values that are being added up "the partial products" just to call them something, it's not really accurate to call them that, I can't come up with anything better. If we transpose that a little bit, then it's built up like this: the lowest bit of every partial product is the lowest bit of the position of every bit multiplied by that bit. Single-bit-products are bitwise-AND, and the values of the lowest bits of the positions are 0,1,0,1 etc, so it works out to x & 0xAAAAAAAA, the second bit of every partial product is x & 0xCCCCCCCC (and has a "weight" of 2, so this must be multiplied by 2) etc.
Then the whole thing needs to be shifted right by 1, to account for the factor of 0.5
So in total:
unsigned CountCumulativeTrailingZeros(unsigned x)
{
--x;
unsigned sum = x;
sum += (x >> 1) & 0x55555555;
sum += x & 0xCCCCCCCC;
sum += (x & 0xF0F0F0F0) << 1;
sum += (x & 0xFF00FF00) << 2;
sum += (x & 0xFFFF0000) << 3;
return sum;
}
For an additional explanation, here is a more visual example. Let's temporarily drop the factor of 0.5 again, it doesn't fundamentally change the algorithm but adds some complication.
First I write above every bit of v (some example value), the position of that bit in binary (p0 is the least significant bit of the position, p1 the second bit etc). Read the ps vertically, every column is a number:
p0: 10101010101010101010101010101010
p1: 11001100110011001100110011001100
p2: 11110000111100001111000011110000
p3: 11111111000000001111111100000000
p4: 11111111111111110000000000000000
v : 00000000100001000000001000000000
So for example bit 9 is set, and it has (reading from bottom to top) 01001 above it (9 in binary).
What we want to do (why this works has been explained by Eric's answer), is take the indexes of the bits that are set, shift them to their corresponding positions, and add them. In this case, they are already at their own positions (by construction, the numbers were written at their own positions), so there is no shift, but they still need to be filtered so only the numbers that correspond to set bits survive. This is what I meant by the "single bit products": take a bit of v and multiply it by the corresponding bits of p0, p1, etc.
You can look at that as multiplying the bit value by its index as well so 2^bit * bit as mentioned in the comments. That is not how it is done here, but that is effectively what is done.
Back to the example, applying bitwise-AND results in these partial products:
pp0: 00000000100000000000001000000000
pp1: 00000000100001000000000000000000
pp2: 00000000100000000000000000000000
pp3: 00000000000000000000001000000000
pp4: 00000000100001000000000000000000
v : 00000000100001000000001000000000
The only values that are left are 01001, 10010, 10111, and they are at their corresponding positions (so, already shifted to where they need to go).
Those values must be added, while keeping them at their positions. They don't need to be extracted from the strange form which they are in, addition is freely reorderable (associative and commutative) so it's OK to add all the least significant bits of the partial products to the sum first, then all the seconds bits, and so on. But they have to added with the right "weight", after all a set bit in pp0 corresponds to a 1 at that position but a set bit in pp1 really corresponds to a 2 at that position (since it's the second bit of the number that it is part of). So pp0 is used directly, but pp1 is shifted left by 1, pp2 is shifted left by 2, etc.
The the factor of 0.5 must still be accounted for, which I did mostly by shifting over the bits of the partial products by one less than what their weight would imply. pp0 was shifted left by zero, so it must be shifted right by 1 now. This could be done with less complication by just putting return sum >> 1; at the end, but that would reduce the range of values that the function can handle before running into integer wrapping modulo 232 (also it would cost an extra operation, and doing it the weird way does not).
Observe that if we count from 1 to x instead of to x−1, we have a pattern:
x
sum
sum/x
1
1
1
2
3
1.5
4
8
2
8
20
2.5
16
48
3
So we can easily calculate the sum for any power of two p as p • (1 + ½b), where b is the power (equivalently, the number of the bit that is set or the log2 of the power). We can see this by induction: If the sum from 1 to 2b is 2b•(1+½b) (which it is for b=0), then the sum from 1 to 2b+1 reprises the individual term contributions twice except that the last term adds 2b+1 instead of 2b, so the sum is 2•2b•(1+½b) − 2b + 2b+1 = 2b+1•(1+½b) + ½•2b+1 = 2b+1•(1+½(b+1)).
Further, between any two powers of two, the lower bits reprise the previous partial sums. Thus, for any x, we can compute the cumulative number of trailing zeros by summing the sums for the set bits in it. Recalling this provides the sum for numbers from 1 to x, we adjust by to get the desired sum from 1 to x−1 subtracting one from x before computation:
unsigned CountCumulative(unsigned x)
{
--x;
unsigned sum = 0;
for (unsigned bit = 0; bit < sizeof x * CHAR_BIT; ++bit)
sum += (x & 1u << bit) * (1 + bit * .5);
return sum;
}
We can terminate the loop when x is exhausted:
unsigned CountCumulative(unsigned x)
{
--x;
unsigned sum = 0;
for (unsigned bit = 0; x; ++bit, x >>= 1)
sum += ((x & 1) << bit) * (1 + bit * .5);
return sum;
}
As harold points out, we can factor out the 1, as summing the value of each bit of x equals x:
unsigned CountCumulative(unsigned x)
{
--x;
unsigned sum = x;
for (unsigned bit = 0; x; ++bit, x >>= 1)
sum += ((x & 1) << bit) * bit * .5;
return sum;
}
Then eliminate the floating-point:
unsigned CountCumulative(unsigned x)
{
unsigned sum = --x;
for (unsigned bit = 0; x; ++bit, x >>= 1)
sum += ((x & 1) << bit) / 2 * bit;
return sum;
}
Note that when bit is zero, ((x & 1) << bit) / 2 will lose the fraction, but this irrelevant as * bit makes the contribution zero anyway. For all other values of bit, (x & 1) << bit is even, so the division does not lose anything.
This will overflow unsigned at some point, so one might want to use a wider type for the calculations.
More Code Golf
Another way to add half the values of the bits of x repeatedly depending on their bit position is to shift x (to halve its bit values) and then add that repeatedly while removing successive bits from low to high:
unsigned CountCumulative(unsigned x)
{
unsigned sum = --x;
for (unsigned bit = 0; x >>= 1; ++bit)
sum += x << bit;
return sum;
}
I have recorded data containing a vector of bit-sequences, which I would like to re-arrange efficiently. One value in the vector of data could look like this:
bit0, bit1, bit2, ... bit7
I would like to re-arrange this bit-sequence into this order:
bit0, bit7, bit1, bit6, bit2, bit5, bit3, bit4
If I had only one value this would work nicely via:
sum(uint32(bitset(0,1:8,bitget(uint32(X), [1 8 2 7 3 6 4 5]))))
Unfortunately bitset and bitget are not capable of handling vectors of bit-sequences. Since I have a fairly large dataset I am interested in efficient solutions.
Any help would be appreciated, thanks!
dec2bin and bin2dec can process vectors, you can input all numbers at once and permute the matrix:
input=1:23;
pattern = [1 8 2 7 3 6 4 5];
bit=dec2bin(input(:),numel(pattern));
if size(bit,2)>numel(pattern)
warning('input numbers to large for pattern, leading bits will be cut off')
end
output=bin2dec(bit(:,pattern));
if available, I would use de2bi and bi2de instead.
I don't know if I may get the question wrong, but isn't it just solvable by indexing wrapped into cellfun?
%// example data
BIN{1} = dec2bin(84,8)
BIN{2} = dec2bin(42,8)
%// pattern and reordering
pattern = [1 8 2 7 3 6 4 5];
output = cellfun(#(x) x(pattern), BIN, 'uni', 0)
Or what is the format of you input and desired output?
BIN =
'01010100' '00101010'
output =
'00100110' '00011001'
The most efficient way is probably to use bitget and bitset as you did in your question, although you only need an 8-bit integer. Suppose you have a uint8 array X which describes your recorded data (for the example below, X = uint8([169;5]), for no particular reason. We can inspect the bits by creating a useful anonymous function:
>> dispbits = #(W) arrayfun(#(X) disp(bitget(X,1:8)),W)
>> dispbits =
#(W)arrayfun(#(X)disp(bitget(X,1:8)),W)
>> dispbits(X)
1 0 0 1 0 1 0 1
1 0 1 0 0 0 0 0
and suppose you have some pattern pattern according to which you want to reorder the bits stored in this vector of integers:
>> pattern
pattern =
1 8 2 7 3 6 4 5
You can use arrayfun and find to reorder the bits according to pattern:
Y = arrayfun(#(X) uint8(sum(bitset(uint8(0),find(bitget(X,pattern))))), X)
Y =
99
17
We get the desired answer stored efficiently in a vector of 8 bit integers:
>> class(Y)
ans =
uint8
>> dispbits(Y)
1 1 0 0 0 1 1 0
1 0 0 0 1 0 0 0
I'm trying calculate crc32 for multithread. I'm trying use OpenCL.
The GPU code is:
__kernel void crc32_Sarwate( __global int* lenghtIn,
__global unsigned char *In,
__global int *OutCrc32,
int size ) {
int i, j, len;
i = get_global_id( 0 );
if( i >= size )
return;
len = j = 0;
while( j != i )
len += lenghtIn[ j++ ];
OutCrc32[ i ] = crc32( In + len, lenghtIn[ i ] ); }
I received this results( time ) with a thousand repetitions:
for 4 using work-item: 29.82
for 8 using work-item: 29.9
for 16 using work-item: 35.16
for 32 using work-item: 35.93
for 64 using work-item: 38.69
for 128 using work-item: 52.83
for 256 using work-item: 152.08
for 512 using work-item: 333.63
I have intel HD Graphics with 350 MHz and 3 work-group with 256 work-item
each work-group.
I assumed that by increasing the number of work-item 128 to 256 happen slight increase in time, but time tripled. Why?
( I'm sorry for my very bad English ).
The
while( j != i )
len += lenghtIn[ j++ ];
part runs for get_global_id( 0 ) times.
When it is 128, the latest work item to complete is doing 128 loop iterations.
When it is 256, it is doing 256 iterations so it should be %100 increase from memory's point of view but only for the last work item. When we integrate all workers' total memory access numbers,
1 item from 0 to 0 ---> 1 access
2 item from 0 to 0 and 0 to 1 ---> 3 access
4 item from 0 to 0 and 0 to 1 and 0 to 2 and 0 to 3---> 10 access
8 items: SUM(1 to 8) => 36 accesses
16 items: SUM(1 to 16) => 136 accesses (even more than + %200)
32 items: => 528 (~ %400)
64 items: => 2080 ( ~%400)
128 items: => 8256 (~%400) (cache of your igpu starts failing here)
256 items: => 32896 (~400%) (now caching is saturated and you start )
( seeing %400 per doubling of work items)
512 => uses second compute unit too! But %400 work is done
so it is not only %200 time consuming.
so each time you increase work items by %100, you increase total memory
accesses to %400 . But caching helps up to some degree. When you cross that, memory accesses increase badly. Alse the execution overhead(drivers,..) becomes unimportant.
You are accessing to memory non-parallel. You need to cache it first but it may not be possible in that hardware so you should distribute the job equally among workitems and make memory accesses contiguous between cores(vectorize). This should give more performance.
For now, each vector unit does:
unit : v0 v1 v2 v3 v4 ... v7
read address: 0 0 0 0 0 0
- 1 1 1 1 1
- - 2 2 2 2
- - - 3 3 3
- - - - 4 4
....
- - - - - ... 7
done in 8 steps on 8 streaming cores.
At the last step, only single work item is actually computing something. This should be something like:
Some Optimization
unit : v0 v1 v2 v3 no need other work items
read address: 0 0 0 0 \
1 1 1 1 \
2 2 2 2 \
3 3 3 3 / this is 5th work item's work
4 4 4 4 /
5 5 5 0 \
6 6 0 1 \ this is 0 to 3 as 4th work
7 0 1 2 /
first item<-- 0 1 2 3 /
done in 8 steps in only 4 streaming cores and is doing same job for the first
half part(probably faster).
Further Optimization Suggestion
I think it would be better with a prefix-scan(sum) algorithm on another kernel before getting to crc32() part. (probably in just 3 steps for this example rather than 8 and also more efficient)
Using precomputed values of
while( j != i )
len += lenghtIn[ j++ ];
should make crc32 immune to the current algorithm complexity (O(n²)).
This was asked in one of the interview I gave. I couldn't answer this properly.
I want to find out how many bits are enabled based on a number.
Suppose , if the number is 2 , I should return 3.
if the number is 3 , I should return 7
8 4 2 1
1 1
8 4 2 1
1 1 1
Is there any easy way of doing it?
Yes, there is: subtract 1 from the corresponding power of 2, like this:
int allBitsSet = (1U << n) - 1;
The expression (1U << n) - 1 computes the value of 2 to the power of n, which always has this form in binary:
1000...00
i.e. one followed by n zeros. When you subtract 1 from a number of that form, you "borrow" from the bit that is set to 1 making it zero, and flip the remaining bits to 1.
You can visualize this by solving an analogous problem in decimal system: "make a number that has n nines". The solution is the same, except now you need to use 10 instead of 2.
I found the lookup table here. The table is generated as a reverse bits table of 8 bits.
I can not figure out why it works. Please explain the theory behind it. Thanks
static const unsigned char BitReverseTable256[256] =
{
# define R2(n) n, n + 2*64, n + 1*64, n + 3*64
# define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)
# define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )
R6(0), R6(2), R6(1), R6(3)
};
First off a comment: This kind of thing is normally only done in the IOCCC. Code like this should not be used in production-environments because it is non-obvious. The reason why i mention this is to remove the false impression that this has any performance- or space benefit, the compiled code will contain the same (number of) bytes you would get if writing the 256 numbers directly into the array.
Ok, now to how it works. It works recursively of course, defining two bits at a top level R6, then two more at the next... But how in detail? Ok:
The first clue you get is the interesting 0->2->1->3 sequence. You should ask yourself "why?". This is the building block that is required for the construction. The numbers 0 1 2 3 in binary are 00 01 10 11 and if you reverse each: 00 10 01 11 which is 0 2 1 3!
Now lets take a look at what we want the table to do: It should become something like this:
00000000 10000000 01000000 11000000
00100000 10100000 01100000 11100000
00010000 10010000 01010000 11010000
00110000 10110000 01110000 11110000 ...
because you want it to map index 0 to 0, index 00000001 to 10000000 and so on.
Notice that the most significant (leftmost) 2 bits of each number: 00 10 01 11 for every line!
Now notice that the second most significant 2 bits of each number increase the same way (00 10 01 11) but for the "columns".
The reason why i chose to order the array in rows of length 4 is, that we found out that 2 bits are written at a time and 2 bits can create 4 patterns.
If you then continue observing the remaining numbers of the table (256 entries total) you will see that the 3rd 2 bits can be found having the 00 10 01 11 sequence if you order the table in columns of 16 and the last 2 bits when you order it in columns of 64.
Now i implicitly told you where the numbers 16 and 64 in the original macro-expansion came from.
That are the details, and to generalize: The highest level of the recursion generates the least significant 2 bits, the middle two levels do their thing and the lowest level generates the most significant 2 bits.
If you are working with Python and happen to end up here, this is how the lookup table would look like. Nevertheless, Bernd Elkemann's explanation still stands.
# Generating the REVERSE_BIT_LUT while pre-processing
# LUT is shorthand for lookuptable
def R2(n, REVERSE_BIT_LUT):
REVERSE_BIT_LUT.extend([n, n + 2 * 64, n + 1 * 64, n + 3 * 64])
def R4(n, REVERSE_BIT_LUT):
return (
R2(n, REVERSE_BIT_LUT),
R2(n + 2 * 16, REVERSE_BIT_LUT),
R2(n + 1 * 16, REVERSE_BIT_LUT),
R2(n + 3 * 16, REVERSE_BIT_LUT),
)
def R6(n, REVERSE_BIT_LUT):
return (
R4(n, REVERSE_BIT_LUT),
R4(n + 2 * 4, REVERSE_BIT_LUT),
R4(n + 1 * 4, REVERSE_BIT_LUT),
R4(n + 3 * 4, REVERSE_BIT_LUT),
)
def LOOK_UP(REVERSE_BIT_LUT):
return (
R6(0, REVERSE_BIT_LUT),
R6(2, REVERSE_BIT_LUT),
R6(1, REVERSE_BIT_LUT),
R6(3, REVERSE_BIT_LUT),
)
# LOOK_UP is the function to generate the REVERSE_BIT_LUT
REVERSE_BIT_LUT = list()
LOOK_UP(REVERSE_BIT_LUT)
I am working on providing code snippets of Sean's bit hacks in Python here.
Reverse bits table is just one of possible offline generated constants. People find an algorithm to define it by using unrolling macro. It won't be possible to find such algorithm for another constant. So you will have to maintain some generators infrastructure anyway.
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#define BYTES_PER_LINE 16
#define BYTES_GLUE ", "
#define LINE_PREFIX " "
#define LINE_TERMINATOR ",\n"
#define PRINT(string) fwrite(string, 1, sizeof(string), stdout)
static inline void print_reversed_byte(uint8_t byte) {
uint8_t reversed_byte = 0;
for (uint8_t bit_index = 0; bit_index < 8; bit_index++) {
uint8_t bit_value = (byte >> bit_index) & 1;
reversed_byte |= bit_value << (7 - bit_index);
}
printf("0x%02x", reversed_byte);
}
int main() {
uint8_t index = 0;
while (true) {
if (index != 0) {
if (index % BYTES_PER_LINE == 0) {
PRINT(LINE_TERMINATOR);
PRINT(LINE_PREFIX);
} else {
PRINT(BYTES_GLUE);
}
}
print_reversed_byte(index);
if (index == 255) {
break;
}
index++;
}
return 0;
}
Use it in generated_constants.c.in with cmake:
const uint8_t REVERSE_BITS_TABLE[256] = {
#CMAKE_REVERSE_BITS_TABLE#
};
You will receive pretty and compact table.
See it's usage in LZWS for example.