CRC calculation reduction

CRC calculation reduction - c

I have one math and programming related question about CRC calculations, to avoid recompute full CRC for a block when you must change only a small portion of it.
My problem is the following: I have a 1K block of 4 byte structures, each one representing a data field. The full 1K block has a CRC16 block at the end, computed over the full 1K. When I have to change only a 4 byte structure, I should recompute the CRC of the full block but I'm searching for a more efficient solution to this problem. Something where:
I take the full 1K block current CRC16
I compute something on the old 4 byte block
I "subtract" something obtained at step 2 from the full 1K CRC16
I compute something on the new 4 byte block
I "add" something obtained at step 4 to the result obtained at step 3
To summarize, I am thinking about something like this:
CRC(new-full) = [CRC(old-full) - CRC(block-old) + CRC(block-new)]
But I'm missing the math behind and what to do to obtain this result, considering also a "general formula".
Thanks in advance.

Take your initial 1024-byte block A and your new 1024-byte block B. Exclusive-or them to get block C. Since you only changed four bytes, C will be bunch of zeros, four bytes which are the exclusive-or of the previous and new four bytes, and a bunch more zeros.
Now compute the CRC-16 of block C, but without any pre or post-processing. We will call that CRC-16'. (I would need to see the specific CRC-16 you're using to see what that processing is, if anything.) Exclusive-or the CRC-16 of block A with the CRC-16' of block C, and you now have the CRC-16 of block B.
At first glance, this may not seem like much of a gain compared to just computing the CRC of block B. However there are tricks to rapidly computing the CRC of a bunch of zeros. First off, the zeros preceding the four bytes that were changed give a CRC-16' of zero, regardless of how many zeros there are. So you just start computing the CRC-16' with the exclusive-or of the previous and new four bytes.
Now you continue to compute the CRC-16' on the remaining n zeros after the changed bytes. Normally it takes O(n) time to compute a CRC on n bytes. However if you know that they are all zeros (or all some constant value), then it can be computed in O(log n) time. You can see an example of how this is done in zlib's crc32_combine() routine, and apply that to your CRC.
Given your CRC-16/DNP parameters, the zeros() routine below will apply the requested number of zero bytes to the CRC in O(log n) time.
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial, reflected. For speed, this requires that a not be zero.
uint16_t multmodp(uint16_t a, uint16_t b) {
uint16_t m = (uint16_t)1 << 15;
uint16_t p = 0;
for (;;) {
if (a & m) {
p ^= b;
if ((a & (m - 1)) == 0)
break;
}
m >>= 1;
b = b & 1 ? (b >> 1) ^ 0xa6bc : b >> 1;
}
return p;
}
// Table of x^2^n modulo p(x).
uint16_t const x2n_table[] = {
0x4000, 0x2000, 0x0800, 0x0080, 0xa6bc, 0x55a7, 0xfc4f, 0x1f78,
0xa31f, 0x78c1, 0xbe76, 0xac8f, 0xb26b, 0x3370, 0xb090
};
// Return x^(n*2^k) modulo p(x).
uint16_t x2nmodp(size_t n, unsigned k) {
k %= 15;
uint16_t p = (uint16_t)1 << 15;
for (;;) {
if (n & 1)
p = multmodp(x2n_table[k], p);
n >>= 1;
if (n == 0)
break;
if (++k == 15)
k = 0;
}
return p;
}
// Apply n zero bytes to crc.
uint16_t zeros(uint16_t crc, size_t n) {
return multmodp(x2nmodp(n, 3), crc);
}

CRC actually makes this an easy thing to do.
When looking into this, I'm sure you've started to read that CRCs are calculated with polynomials over GF(2), and probably skipped over that part to the immediately useful information. Well, it sounds like it's probably time for you to go back over that stuff and reread it a few times so you can really understand it.
But anyway...
Because of the way CRCs are calculated, they have a property that, given two blocks A and B, CRC(A xor B) = CRC(A) xor CRC(B)
So the first simplification you can make is that you just need to calculate the CRC of the changed bits. You could actually precalculate the CRCs of each bit in the block, so that when you change a bit you can just xor it's CRC into the block's CRC.
CRCs also have the property that CRC(A * B) = CRC(A * CRC(B)), where that * is polynomial multiplication over GF(2). If you stuff the block with zeros at the end, then don't do that for CRC(B).
This lets you get away with a smaller precalculated table. "Polynomial multiplication over GF(2)" is binary convolution, so multiplying by 1000 is the same as shifting by 3 bits. With this rule, you can precalculate the CRC of the offset of each field. Then just multiply (convolve) the changed bits by the offset CRC (calculated without zero stuffing), calculate the CRC of those 8 byes, and xor them into the block CRC.

The CRC is the remainder of the long integer formed by the input stream and the short integer corresponding to the polynomial, say p.
If you change some bits in the middle, this amounts to a perturbation of the dividend by n 2^k where n has the length of the perturbed section and k is the number of bits that follow.
Hence, you need to compute the perturbation of the remainder, (n 2^k) mod p. You can address this using
(n 2^k) mod p = (n mod p) (2^k mod p)
The first factor is just the CRC16 of n. The other factor can be obtained efficiently in Log k operations by the power algorithm based on squarings.

CRC depends of the calculated CRC of the data before.
So the only optimization is, to logical split the data into N segment and store the computed CRC-state for each segment.
Then, of e.g. modifying segment 6 (of 0..9), get the CRC-state of segment 5, and continue calculating CRC beginning with segment 6 and ending with 9.
Anyway, CRC calculations are very fast. So think, if it is worth it.

Related

Need help understanding bitmaps, bitwise operations, and C

Disclaimer: I am asking these questions in relation to an assignment. The assignment itself calls for implementing a bitmap and doing some operations with that, but that is not what I am asking about. I just want to understand the concepts so I can try the implementation for myself.
I need help understanding bitmaps/bit arrays and bitwise operations. I understand the basics of binary and how left/right shift work, but I don't know exactly how that use is beneficial.
Basically, I need to implement a bitmap to store the results of a prime sieve (of Eratosthenes.) This is a small part of a larger assignment focused on different IPC methods, but to get to that part I need to get the sieve completed first. I've never had to use bitwise operations nor have I ever learned about bitmaps, so I'm kind of on my own to learn this.
From what I can tell, bitmaps are arrays of a bit of a certain size, right? By that I mean you could have an 8-bit array or a 32-bit array (in my case, I need to find the primes for a 32-bit unsigned int, so I'd need the 32-bit array.) So if this is an array of bits, 32 of them to be specific, then we're basically talking about a string of 32 1s and 0s. How does this translate into a list of primes? I figure that one method would evaluate the binary number and save it to a new array as decimal, so all the decimal primes exist in one array, but that seems like you're using too much data.
Do I have the gist of bitmaps? Or is there something I'm missing? I've tried reading about this around the internet but I can't find a source that makes it clear enough for me...

Suppose you have a list of primes: {3, 5, 7}. You can store these numbers as a character array: char c[] = {3, 5, 7} and this requires 3 bytes.
Instead lets use a single byte such that each set bit indicates that the number is in the set. For example, 01010100. If we can set the byte we want and later test it we can use this to store the same information in a single byte. To set it:
char b = 0;
// want to set `3` so shift 1 twice to the left
b = b | (1 << 2);
// also set `5`
b = b | (1 << 4);
// and 7
b = b | (1 << 6);
And to test these numbers:
// is 3 in the map:
if (b & (1 << 2)) {
// it is in...

You are going to need a lot more than 32 bits.
You want a sieve for up to 2^32 numbers, so you will need a bit for each one of those. Each bit will represent one number, and will be 0 if the number is prime and 1 if it is composite. (You can save one bit by noting that the first bit must be 2 as 1 is neither prime nor composite. It is easier to waste that one bit.)
2^32 = 4,294,967,296
Divide by 8
536,870,912 bytes, or 1/2 GB.
So you will want an array of 2^29 bytes, or 2^27 4-byte words, or whatever you decide is best, and also a method for manipulating the individual bits stored in the chars (ints) in the array.
It sounds like eventually, you are going to have several threads or processes operating on this shared memory.You may need to store it all in a file if you can't allocate all that memory to yourself.
Say you want to find the bit for x. Then let a = x / 8 and b = x - 8 * a. Then the bit is at arr[a] & (1 << b). (Avoid the modulus operator % wherever possible.)
//mark composite
a = x / 8;
b = x - 8 * a;
arr[a] |= 1 << b;
This sounds like a fun assignment!

A bitmap allows you to construct a large predicate function over the range of numbers you're interested in. If you just have a single 8-bit char, you can store Boolean values for each of the eight values. If you have 2 chars, it doubles your range.
So, say you have a bitmap that already has this information stored, your test function could look something like this:
bool num_in_bitmap (int num, char *bitmap, size_t sz) {
if (num/8 >= sz) return 0;
return (bitmap[num/8] >> (num%8)) & 1;
}

Hash function for 64 bit to 10 bits

I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.

I would say somethig like this:
uint32_t hash(uint64_t x)
{
x >>= 3;
return (x ^ (x>>10) ^ (x>>20)) & 0x3FF;
}
The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them.
Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.
Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.

I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:
20125e8
20125e6
20125e7
20125e4
3fef2131
3fef2130
3fef212f
3fef212c
25e4802
25e4806
What this tells me:
The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.
What I would do next:
/* Drop two LSBs. */
a >>= 2;
/* Get rid of the MSBs. Keep 46 bits. */
a &= 0x3fffffffffff;
/* Get the 14 MSBs and fold them in to get a 32 bit integer.
The MSBs are mostly 0s anyway, so we don't lose much entropy. */
msbs = (a >> 32) << 18;
a ^= msbs;
Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:
uint32_t half_avalanche( uint32_t a)
{
a = (a+0x479ab41d) + (a<<8);
a = (a^0xe4aa10ce) ^ (a>>5);
a = (a+0x9942f0a6) - (a<<14);
a = (a^0x5aedd67d) ^ (a>>3);
a = (a+0x17bea992) + (a<<7);
return a;
}
For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.
I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:
1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions
Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)

Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.
static inline int hashaddress(void *v)
{
return (uintptr_t)v % 1021;
}
If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.

Efficient conditional for increasing size in bits

Suppose I have an increasing sequence of unsigned integers C[i]. As they increase, it's likely that they will occupy increasingly many bits. I'm looking for an efficient conditional, based purely on two consecutive elements of the sequence C[i] and C[i+1] (past and future ones are not observable), that will evaluate to true either exactly or approximately once for every time the number of bits required increases.
An obvious (but slow) choice of conditional is:
if (ceil(log(C[i+1])) > ceil(log(C[i]))) ...
and likewise anything that computes the number of leading zero bits using special cpu opcodes (much better but still not great).
I suspect there may be a nice solution involving an expression using just bitwise or and bitwise and on the values C[i+1] and C[i]. Any thoughts?

Suppose your two numbers are x and y. If they have the same high order bit, then x^y is less than both x and y. Otherwise, it is higher than one of the two.
So
v = x^y
if (v > x || v > y) { ...one more bit... }

I think you just need clz(C[i+1]) < clz(C[i]) where clz is a function which returns the number of leading zeroes ("count leading zeroes"). Some CPU families have an instruction for this (which may be available as an instrinsic). If not then you have to roll your own (it typically only takes a few instructions) - see Hacker's Delight.

Given (I believe this comes from Hacker's Delight):
int hibit(unsigned int n) {
n |= (n >> 1);
n |= (n >> 2);
n |= (n >> 4);
n |= (n >> 8);
n |= (n >> 16);
return n - (n >> 1);
}
Your conditional is simply hibit(C[i]) != hibit(C[i+1]).

BSR - Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes
reg,reg - - 10+3n 6-103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7
You need two of these (on C[i] and C[i]+1) and a compare.

Keith Randall's solution is good, but you can save one xor instruction by using the following code which processes the entire sequence in O(w + n) instructions, where w is the number of bits in a word, and n is the number of elements in the sequence. If the sequence is long, most iterations will only involve one comparison, avoiding one xor instruction.
This is accomplished by tracking the highest power of two that has been reached as follows:
t = 1; // original setting
if (c[i + 1] >= t) {
do {
t <<= 1;
} while (c[i + 1] >= t); // watch for overflow
... // conditional code here
}

The number of bits goes up when the value is about overflow a power of two. A simple test is then, is the value equal to a power of two, minus 1? This can be accomplished by asking:
if ((C[i] & (C[i]+1))==0) ...

The number of bits goes up when the value is about to overflow a power of two.
A simple test is then:
while (C[i] >= (1<<number_of_bits)) then number_of_bits++;
If you want it even faster:
int number_of_bits = 1;
int two_to_number_of_bits = 1<<number_of_bits ;
... your code ....
while ( C[i]>=two_to_number_of_bits )
{ number_of_bits++;
two_to_number_of_bits = 1<<number_of_bits ;
}

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}

If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.

My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.

It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

Large bit arrays in C

Our OS professor mentioned that for assigning a process id to a new process, the kernel incrementally searches for the first zero bit in a array of size equivalent to the maximum number of processes(~32,768 by default), where an allocated process id has 1 stored in it.
As far as I know, there is no bit data type in C. Obviously, there's something I'm missing here.
Is there any such special construct from which we can build up a bit array? How is this done exactly?
More importantly, what are the operations that can be performed on such an array?

Bit arrays are simply byte arrays where you use bitwise operators to read the individual bits.
Suppose you have a 1-byte char variable. This contains 8 bits. You can test if the lowest bit is true by performing a bitwise AND operation with the value 1, e.g.
char a = /*something*/;
if (a & 1) {
/* lowest bit is true */
}
Notice that this is a single ampersand. It is completely different from the logical AND operator &&. This works because a & 1 will "mask out" all bits except the first, and so a & 1 will be nonzero if and only if the lowest bit of a is 1. Similarly, you can check if the second lowest bit is true by ANDing it with 2, and the third by ANDing with 4, etc, for continuing powers of two.
So a 32,768-element bit array would be represented as a 4096-element byte array, where the first byte holds bits 0-7, the second byte holds bits 8-15, etc. To perform the check, the code would select the byte from the array containing the bit that it wanted to check, and then use a bitwise operation to read the bit value from the byte.
As far as what the operations are, like any other data type, you can read values and write values. I explained how to read values above, and I'll explain how to write values below, but if you're really interested in understanding bitwise operations, read the link I provided in the first sentence.
How you write a bit depends on if you want to write a 0 or a 1. To write a 1-bit into a byte a, you perform the opposite of an AND operation: an OR operation, e.g.
char a = /*something*/;
a = a | 1; /* or a |= 1 */
After this, the lowest bit of a will be set to 1 whether it was set before or not. Again, you could write this into the second position by replacing 1 with 2, or into the third with 4, and so on for powers of two.
Finally, to write a zero bit, you AND with the inverse of the position you want to write to, e.g.
char a = /*something*/;
a = a & ~1; /* or a &= ~1 */
Now, the lowest bit of a is set to 0, regardless of its previous value. This works because ~1 will have all bits other than the lowest set to 1, and the lowest set to zero. This "masks out" the lowest bit to zero, and leaves the remaining bits of a alone.

A struct can assign members bit-sizes, but that's the extent of a "bit-type" in 'C'.
struct int_sized_struct {
int foo:4;
int bar:4;
int baz:24;
};
The rest of it is done with bitwise operations. For example. searching that PID bitmap can be done with:
extern uint32_t *process_bitmap;
uint32_t *p = process_bitmap;
uint32_t bit_offset = 0;
uint32_t bit_test;
/* Scan pid bitmap 32 entries per cycle. */
while ((*p & 0xffffffff) == 0xffffffff) {
p++;
}
/* Scan the 32-bit int block that has an open slot for the open PID */
bit_test = 0x80000000;
while ((*p & bit_test) == bit_test) {
bit_test >>= 1;
bit_offset++;
}
pid = (p - process_bitmap)*8 + bit_offset;
This is roughly 32x faster than doing a simple for loop scanning an array with one byte per PID. (Actually, greater than 32x since more of the bitmap is will stay in CPU cache.)

see http://graphics.stanford.edu/~seander/bithacks.html

No bit type in C, but bit manipulation is fairly straight forward. Some processors have bit specific instructions which the code below would nicely optimize for, even without that should be pretty fast. May or may not be faster using an array of 32 bit words instead of bytes. Inlining instead of functions would also help performance.
If you have the memory to burn just use a whole byte to store one bit (or whole 32 bit number, etc) greatly improve performance at the cost of memory used.
unsigned char data[SIZE];
unsigned char get_bit ( unsigned int offset )
{
//TODO: limit check offset
if(data[offset>>3]&(1<<(offset&7))) return(1);
else return(0);
}
void set_bit ( unsigned int offset, unsigned char bit )
{
//TODO: limit check offset
if(bit) data[offset>>3]|=1<<(offset&7);
else data[offset>>3]&=~(1<<(offset&7));
}