Is this conversion right from the original?
uint8_t fletcher8( uint8_t *data, uint8_t len )
{
uint8_t sum1 = 0xff, sum2 = 0xff;
while (len) {
unsigned tlen = len > 360 ? 360 : len;
len -= tlen;
do {
sum1 += *data++;
sum2 += sum1;
tlen -= sizeof( uint8_t );
} while (tlen);
sum1 = (sum1 & 0xff) + (sum1 >> 4);
sum2 = (sum2 & 0xff) + (sum2 >> 4);
}
/* Second reduction step to reduce sums to 4 bits */
sum1 = (sum1 & 0xff) + (sum1 >> 4);
sum2 = (sum2 & 0xff) + (sum2 >> 4);
return sum2 << 4 | sum1;
}
Original:
uint32_t fletcher32( uint16_t *data, size_t len )
{
uint32_t sum1 = 0xffff, sum2 = 0xffff;
while (len) {
unsigned tlen = len > 360 ? 360 : len;
len -= tlen;
do {
sum1 += *data++;
sum2 += sum1;
tlen -= sizeof( uint16_t );
} while (tlen);
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
}
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
return sum2 << 16 | sum1;
}
len will be 8.
data will have an input of data[] (1 - 8)
Actually I don't know what to do with the line: unsigned tlen = len > 360 ? 360 : len;
Maybe -> int8_t tlen = len > 255 ? 255 : len;
How to compute this tlen value
Actually I don't know what to do with the line: unsigned tlen = len > 360 ? 360 : len;
That line appears to come from an old version of this Wikipedia section. It has been changed to 359 by now, with the rationale explained on the talk page. The number only applies to summing 16 bit entities, as it is the largest number n satisfying
n(n+5)/2 × (216−1) < 232
In other words, this is the larges number of times you can add blocks without performing a modulo reduction, and still avoid overflowing the uint32_t. For 4 bit data words and a 8 bit accumulator, the corresponding value will be 4, computed using
n(n+5)/2 × (24−1) < 28
So you have to modify that line if you change your data size. You could also change the code to use a larger data type to keep its sums, thus summing more blocks before the reduction. But in that case, you might require more than one reduction step inside the loop.
For example, if you were to use uint32_t for sum1 and sum2, then you could sum 23927 nibbles before danger of overflow, but after that you would require up to 7 reductions of the form sum1 = (sum1 & 0xf) + (sum1 >> 4) to boil this down to the range 1 through 0x1e, the way your original agorithm does it. It might be more efficient to write this as (sum1 - 1)%0xf + 1. In which case you might even change the range back from 1 through 15 to 0 through 14, initialize the sums to 0 and write the reduction as sum1 %= 0xf. Unless you require compatibility with an implementation that uses the other range.
I think you need 0xF masks throughout not 0xFF. The 32 bit uses a 16 bit mask, half of 32, your 8 bit uses an 8 bit mask which is not half of 8, 4 bits is half of 8.
uint8_t fletcher8( uint8_t *data, uint8_t len )
{
uint8_t sum1 = 0xf, sum2 = 0xf;
while (len) {
unsigned tlen = len > 360 ? 360 : len;
len -= tlen;
do {
sum1 += *data++;
sum2 += sum1;
tlen -= sizeof( uint8_t );
} while (tlen);
sum1 = (sum1 & 0xf) + (sum1 >> 4);
sum2 = (sum2 & 0xf) + (sum2 >> 4);
}
/* Second reduction step to reduce sums to 4 bits */
sum1 = (sum1 & 0xf) + (sum1 >> 4);
sum2 = (sum2 & 0xf) + (sum2 >> 4);
return sum2 << 4 | sum1;
}
Otherwise you are creating a different checksum, not Fletcher. sum1 for example is performing what I think is called a ones complement checksum. Basically it is a 16 bit in prior and 4 bit in your case, checksum where the carry bits are added back in. used in internet protocols makes it very easy to modify a packet without having to compute the checksum on the whole packet, you can add and subtract only the changes against the existing checksum.
The additional reduction step is for a corner case, if the result of sum1 += *data = 0x1F using a four bit scheme, then the adding of the carry bit is 0x01 + 0x0F = 0x10, you need to add that carry bit back in as well so outside the loop 0x01 + 0x00 = 0x01. Otherwise that outside the loop sum is adding in zero. Depending on your architecture you might execute faster with something like if(sum1&0x10) sum1=0x01; than the shifty adding thing that may take more instructions.
Where it becomes something more than just a checksum with the carry bits added in is that last step when the two are combined. And if you for example only use the 32 bit fletcher as a 16 bit checksum, you have wasted your time, the lower 16 bits of the result are just a stock checksum with the carry bit added back in, nothing special. sum2 is the interesting number as it is an accumulation of sum1 checksums (sum1 is an accumulation of the data, sum2 is an accumulation of checksums).
in the original version sum1,sum2 are 32-bit. that is why the bit shifting afterwards. In your case you declare sum1,sum2 to be 8 bit so bit shifting makes no sense.
Related
Recently, just for the heck of it, I've been playing around with an attempt at implementing Keccak, the cryptographic primitive behind SHA-3. I've run into some issues however, specifically with calculating the round constants used in the "Iota" step of the permutation.
Just to get it out of the way: Yes. I know they are round constants. I know I could hard code them as constants. But where's the fun in that?
I've specifically been referencing the FIPS 202 specification document on SHA-3 as well as the Keccak team's own Keccak reference. However, despite my efforts, I can't seem to end up with the correct constants. I've never dealt with bit manipulation before, so if I'm doing something the complete wrong way, feel free to let me know.
rc is a function defined in the FIPS 202 standard of Keccak that is a linear feedback shift register with a feedback polynomial of x^8 + x^6 + x^5 + x^4 + 1.
The values of t (specific to SHA-3) are defined as the set of integers that includes j + 7 * i_r, where i_r = {0, 1, ..., 22, 23} and j = {0, 1, ..., 4, 5}.
The expected outputs (the round constants) are defined as follows: 0x0000000000000001, 0x0000000000008082, 0x800000000000808a,
0x8000000080008000, 0x000000000000808b, 0x0000000080000001,
0x8000000080008081, 0x8000000000008009, 0x000000000000008a,
0x0000000000000088, 0x0000000080008009, 0x000000008000000a,
0x000000008000808b, 0x800000000000008b, 0x8000000000008089,
0x8000000000008003, 0x8000000000008002, 0x8000000000000080,
0x000000000000800a, 0x800000008000000a, 0x8000000080008081,
0x8000000000008080, 0x0000000080000001, and 0x8000000080008008.
rc Function Implementation
uint64_t rc(int t)
{
if(t % 255 == 0)
{
return 0x1;
}
uint64_t R = 0x1;
for(int i = 1; i <= t % 255; i++)
{
R = R << 0x1;
R |= (((R >> 0x0) & 0x1) ^ ((R >> 0x8) & 0x1)) << 0x0;
R |= (((R >> 0x4) & 0x1) ^ ((R >> 0x8) & 0x1)) << 0x4;
R |= (((R >> 0x5) & 0x1) ^ ((R >> 0x8) & 0x1)) << 0x5;
R |= (((R >> 0x6) & 0x1) ^ ((R >> 0x8) & 0x1)) << 0x6;
R &= 0xFF;
}
return R & 0x1;
}
rc Function Call
for(int i_r = 0; i_r < 24; i_r++)
{
uint64_t RC = 0x0;
// TODO: Fix so the limit is not constant
for(int j = 0; j < 6; j++)
{
RC ^= (rc(j + 7 * i_r) << ((int) pow(2, j) - 1));
}
printf("%llu\n", RC);
}
Any help on this matter is much appreciated.
I made some random changes to the code and now it works. Here are the highlights:
The j loop needs to count from 0 to 6. That's because 2^6-1 = 63. So if j is never 6, then the output can never have the MSB set, i.e. an output of 0x8... is not possible.
Using the pow function is generally a bad idea for this type of application. double values have a nasty habit of being slightly lower than desired, e.g. 4 is actually 3.99999999999, which gets truncated to 3 when you convert it to an int. Doubtful that was happening in this case, but why risk it, since it's easy to just multiply variable shift by 2 on each pass through the loop.
The maximum value for t is 7*23+6 = 167, so the % 255 does nothing (at least with the value of i and t in this code). Also, there's no need to treat t == 0 as a special case. The loop won't run when t is 0, so the result is 0x1 by default.
Implementing a linear feedback shift register is quite simple in C. Each term in the polynomial corresponds to a single bit. x^8 is just 2^8 which is 0x100 and x^6 + x^5 + x^4 + 1 is 0x71. So whenever bit 0x100 is set, you XOR the result by 0x71.
Here's the updated code:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
uint64_t rc(int t)
{
uint64_t result = 0x1;
for (int i = 1; i <= t; i++)
{
result <<= 1;
if (result & 0x100)
result ^= 0x71;
}
return result & 0x1;
}
int main(void)
{
for (int i = 0; i < 24; i++)
{
uint64_t result = 0x0;
uint64_t shift = 1;
for (int j = 0; j < 7; j++)
{
uint64_t value = rc(7*i + j);
result |= value << (shift - 1);
shift *= 2;
}
printf("0x%016" PRIx64 "\n", result);
}
}
I'm having a hard time figuring out which implementation of the 32-bit variation of the Fletcher checksum algorithm is correct. Wikipedia provides the following optimized implementation:
uint32_t fletcher32( uint16_t const *data, size_t words ) {
uint32_t sum1 = 0xffff, sum2 = 0xffff;
size_t tlen;
while (words) {
tlen = words >= 359 ? 359 : words;
words -= tlen;
do {
sum2 += sum1 += *data++;
} while (--tlen);
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
}
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
return sum2 << 16 | sum1;
}
In addition, I've adapted the non-optimized 16-bit example from the Wikipedia article to compute a 32-bit checksum:
uint32_t naive_fletcher32(uint16_t *data, int words) {
uint32_t sum1 = 0;
uint32_t sum2 = 0;
int index;
for( index = 0; index < words; ++index ) {
sum1 = (sum1 + data[index]) % 0xffff;
sum2 = (sum2 + sum1) % 0xffff;
}
return (sum2 << 16) | sum1;
}
Both these implementations yield the same results, e.g. 0x56502d2a for the string abcdef. To verify that this is indeed correct, I tried to find other implementations of the algorithm:
An online checksum/hash generator
C++ implementation in the srecord project
There's also a JavaScript implementation
All of these seem to agree that the checksum for abcdef is 0x8180255 instead of the value given by the implementation on Wikipedia. I've narrowed this down to how the data buffer the implementation operates on. All the above non-wikipedia implementation operate one byte at a time, whereas the Wikipedia implementation computes the checksum using 16-bit words. If I modify the above "naive" Wikipedia implementation to operate per-byte instead, it reads like this:
uint32_t naive_fletcher32_per_byte(uint8_t *data, int words) {
uint32_t sum1 = 0;
uint32_t sum2 = 0;
int index;
for( index = 0; index < words; ++index ) {
sum1 = (sum1 + data[index]) % 0xffff;
sum2 = (sum2 + sum1) % 0xffff;
}
return (sum2 << 16) | sum1;
}
The only thing that changes is the signature, really. So this modified naive implementation and the above mentioned implementations (except Wikipedia) agree that the checksum of abcdef is indeed 0x8180255.
My problem now is: which one is correct?
According to the standard, the right method is the one that Wikipedia provides — except the name:
Note that the 8-bit Fletcher algorithm gives a 16-bit checksum and the 16-bit algorithm gives a 32-bit checksum.
In the standard quoted in the answer of HideFromKGB, the algorithm is trivial: the 8-bit version uses only 8 bit accumulators ("ints"), producing 8 bit results A and B, and the 16-bit version uses 16 bit "ints", producing 16 bit results A and B.
It should be noted that what Wikipedia calls the "32 bit Fletcher" is actually the "16 bit Fletcher". The number of bits in the name refers in the standard to the number of bits in each D[i] and in each of A and B, but on Wikipedia it refers to the number of bits in the "stacked result", i.e. in A<<16 | B for the 32 bit result.
I did not implement this, but maybe this can explain the difference. I am inclined to say that your interpretation (implementation) is correct.
N.b.: also note that it is necessary to pad data with zeroes to the appropriate number of bytes.
These are test vectors, which are cross checked with two different implementations for 16-bit and for 32-bit check sums:
8-bit implementation (16-bit checksum)
"abcde" -> 51440 (0xC8F0)
"abcdef" -> 8279 (0x2057)
"abcdefgh" -> 1575 (0x0627)
16-bit implementation (32-bit checksum)
"abcde" -> 4031760169 (0xF04FC729)
"abcdef" -> 1448095018 (0x56502D2A)
"abcdefgh" -> 3957429649 (0xEBE19591)
TCP Alternate Checksum Options describes the Fletcher checksum algorithm for use with TCP: RFC 1146 dated March 1990.
The 8-bit Fletcher algorithm which gives a 16-bit checksum and the 16-bit algorithm which gives a 32-bit checksum are discussed.
The 8-bit Fletcher Checksum Algorithm is calculated over a sequence
of data octets (call them D[1] through D[N]) by maintaining 2
unsigned 1's-complement 8-bit accumulators A and B whose contents are
initially zero, and performing the following loop where i ranges from
1 to N:
A := A + D[i]
B := B + A
The 16-bit Fletcher Checksum algorithm proceeds in precisely the same
manner as the 8-bit checksum algorithm, except that A, B and the
D[i] are 16-bit quantities. It is necessary (as it is with the
standard TCP checksum algorithm) to pad a datagram containing an odd
number of octets with a zero octet.
That agrees with Wikipedia algorithms. The simple testing program confirms quoted results:
#include <stdio.h>
#include <string.h>
#include <stdint.h> // for uint32_t
uint32_t fletcher32_1(const uint16_t *data, size_t len)
{
uint32_t c0, c1;
unsigned int i;
for (c0 = c1 = 0; len >= 360; len -= 360) {
for (i = 0; i < 360; ++i) {
c0 = c0 + *data++;
c1 = c1 + c0;
}
c0 = c0 % 65535;
c1 = c1 % 65535;
}
for (i = 0; i < len; ++i) {
c0 = c0 + *data++;
c1 = c1 + c0;
}
c0 = c0 % 65535;
c1 = c1 % 65535;
return (c1 << 16 | c0);
}
uint32_t fletcher32_2(const uint16_t *data, size_t l)
{
uint32_t sum1 = 0xffff, sum2 = 0xffff;
while (l) {
unsigned tlen = l > 359 ? 359 : l;
l -= tlen;
do {
sum2 += sum1 += *data++;
} while (--tlen);
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
}
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
return (sum2 << 16) | sum1;
}
int main()
{
char *str1 = "abcde";
char *str2 = "abcdef";
size_t len1 = (strlen(str1)+1) / 2; // '\0' will be used for padding
size_t len2 = (strlen(str2)+1) / 2; //
uint32_t f1 = fletcher32_1(str1, len1);
uint32_t f2 = fletcher32_2(str1, len1);
printf("%u %X \n", f1,f1);
printf("%u %X \n\n", f2,f2);
f1 = fletcher32_1(str2, len2);
f2 = fletcher32_2(str2, len2);
printf("%u %X \n",f1,f1);
printf("%u %X \n",f2,f2);
return 0;
}
Output:
4031760169 F04FC729
4031760169 F04FC729
1448095018 56502D2A
1448095018 56502D2A
My answer is focusses on the correctness of s = (s & 0xffff) + (s >> 16);.
This is obviously supposed to replace the modulo operation. Now the big issue with the modulo operation is the division that needs to be performed. The trick is to not do the division and to estimate floor(s / 65535). So instead of computing s - floor(s/65535)*65535, which would be the same as modulo, we compute s - floor(s/65536)*65535. This will obviously not be equivalent to doing modulo. But it's good enough to quickly reduce the size of s.
Now we have
s - floor(s / 65536) * 65535
= s - (s >> 16) * 65535
= s - (s >> 16) * (65536 - 1)
= s - (s >> 16) * 65536 + (s >> 16)
= (s & 0xffff) + (s >> 16)
Since the (s & 0xffff) + (s >> 16) is not equivalent to doing the modulo, it does not suffice to use this formula. If s == 65535 then s % 65535 would yield zero. However, the former formula yields 65535. So the optimized Wikipedia implementation posted here is obviously false! The last 3 lines need to be changed to
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0xffff) + (sum1 >> 16);
sum2 = (sum2 & 0xffff) + (sum2 >> 16);
if (sum1 >= 65535) { sum1 -= 65535; }
if (sum2 >= 65535) { sum2 -= 65535; }
return (sum2 << 16) | sum1;
It is noteworthy, that I can't find the optimized implementation on the Wikipedia page anymore (February 2020).
Addendum:
Imagine s would be equal to the maximum unsigned 32 bit value, that is 0xFFFF_FFFF. Then the formula (s & 0xffff) + (s >> 16); yields 0x1FFFE. That is exactly two times 65535. So the correction step if (s >= 65535) { s -= 65535; } will not work since it subtracts 65535 at most once. So we want to keep sum1 and sum2 in the loops strictly smaller than 0xFFFF_FFFF. Then the formula yields at most 2*65535-1 and the correction step will work. The following simple python program determines, that sum2 would become too big after 360 iterations. So processing at most 359 16 bit words at a time is exactly right.
s1 = 0x1FFFD
s2 = 0x1FFFD
for i in range(1,1000):
s1 += 0xFFFF
s2 += s1
if s2 >= 0xFFFFFFFF:
print(i)
break
I am trying to use a 16-bit Fletcher checksum here. Basically, my program simulates traffic over the physical layer by "sending" and "receiving" packets between two virtual entities. I am printing out the packets at both sides and they do match, but I am getting a different checksum calculated at the receiving end.
Packet structure:
#define MESSAGE_LENGTH 20
struct pkt {
int seqnum;
int acknum;
int checksum;
char payload[MESSAGE_LENGTH];
};
This is the code I'm using to compute the checksum of each packet:
/*
* Computes the fletcher checksum of the input packet
*/
uint16_t calcChecksum(struct pkt *packet) {
/* the data for the checksum needs to be continuous, so here I am making
a temporary block of memory to hold everything except the packet checksum */
size_t sizeint = sizeof(int);
size_t size = sizeof(struct pkt) - sizeint;
uint8_t *temp = malloc(size);
memcpy(temp, packet, sizeint * 2); // copy the seqnum and acknum
memcpy(temp + (2*sizeint), &packet->payload, MESSAGE_LENGTH); // copy data
// calculate checksum
uint16_t checksum = fletcher16((uint8_t const *) &temp, size);
free(temp);
return checksum;
}
/*
* This is a checksum algorithm that I shamelessly copied off a wikipedia page.
*/
uint16_t fletcher16( uint8_t const *data, size_t bytes ) {
uint16_t sum1 = 0xff, sum2 = 0xff;
size_t tlen;
while (bytes) {
tlen = bytes >= 20 ? 20 : bytes;
bytes -= tlen;
do {
sum2 += sum1 += *data++;
} while (--tlen);
sum1 = (sum1 & 0xff) + (sum1 >> 8);
sum2 = (sum2 & 0xff) + (sum2 >> 8);
}
/* Second reduction step to reduce sums to 8 bits */
sum1 = (sum1 & 0xff) + (sum1 >> 8);
sum2 = (sum2 & 0xff) + (sum2 >> 8);
return sum2 << 8 | sum1;
}
I don't know much about checksums and I copied that algorithm off a page I found, so if anyone can understand why the checksums for two otherwise identical packets are different I would greatly appreciate it. Thank you!
The problem occurs because you are not passing the address of temp data to the checksum function but rather the address to where the variable temp is stored on stack.
You should change
uint16_t checksum = fletcher16((uint8_t const *) &temp, size);
to
uint16_t checksum = fletcher16((uint8_t const *) temp, size);
^ no & operator
So I have a design which incorporates CRC32C checksums to ensure data hasn't been damaged. I decided to use CRC32C because I can have both a software version and a hardware-accelerated version if the computer the software runs on supports SSE 4.2
I'm going by Intel's developer manual (vol 2A), which seems to provide the algorithm behind the crc32 instruction. However, I'm having little luck. Intel's developer guide says the following:
BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2
TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0] <- BIT_REFLECT(TEMP6[31-0])
Now, as far as I can tell, I've done everything up to the line starting TEMP6 correctly, but I think I may be either misunderstanding the polynomial division, or implementing it incorrectly. If my understanding is correct, 1 / 1 mod 2 = 1, 0 / 1 mod 2 = 0, and both divides-by-zero are undefined.
What I don't understand is how binary division with 64-bit and 33-bit operands will work. If SRC is 0x00000000, and DEST is 0xFFFFFFFF, TEMP5[63-32] will be all set bits, while TEMP5[31-0] will be all unset bits.
If I was to use the bits from TEMP5 as the numerator, there would be 30 divisions by zero as the polynomial 11EDC6F41 is only 33 bits long (and so converting it to a 64-bit unsigned integer leaves the top 30 bits unset), and so the denominator is unset for 30 bits.
However, if I was to use the polynomial as the numerator, the bottom 32 bits of TEMP5 are unset, resulting in divides by zero there, and the top 30 bits of the result would be zero, as the top 30 bits of the numerator would be zero, as 0 / 1 mod 2 = 0.
Am I misunderstanding how this works? Just plain missing something? Or has Intel left out some crucial step in their documentation?
The reason I went to Intel's developer guide for what appeared to be the algorithm they used is because they used a 33-bit polynomial, and I wanted to make outputs identical, which didn't happen when I used the 32-bit polynomial 1EDC6F41 (show below).
uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;
for (n = 0; n < 256; n++) {
sres = n;
for (k = 0; k < 8; k++)
sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
crcTable[n] = sres;
}
sres = 0xFFFFFFFF;
for (n = 0; n < 4; n++) {
sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);
}
The above code produces 4138093821 as an output, and the crc32 opcode produces 2346497208 using the input 0x00000000.
Sorry if this is badly written or incomprehensible in places, it is rather late for me.
Here are both software and hardware versions of CRC-32C. The software version is optimized to process eight bytes at a time. The hardware version is optimized to run three crc32q instructions effectively in parallel on a single core, since the throughput of that instruction is one cycle, but the latency is three cycles.
crc32c.c:
/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
* Copyright (C) 2013, 2021 Mark Adler
* Version 1.2 5 Jun 2021 Mark Adler
*/
/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Version History:
1.0 10 Feb 2013 First version
1.1 31 May 2021 Correct register constraints on assembly instructions
Include pre-computed tables to avoid use of pthreads
Return zero for the CRC when buf is NULL, as initial value
1.2 5 Jun 2021 Make tables constant
*/
// Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software
// version is provided as a fall-back, as well as for speed comparisons.
#include <stddef.h>
#include <stdint.h>
// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"
// Table-driven software version as a fall-back. This is about 15 times slower
// than using the hardware instructions. This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;
unsigned char const *data = buf;
while (len && ((uintptr_t)data & 7) != 0) {
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
len--;
}
size_t n = len >> 3;
for (size_t i = 0; i < n; i++) {
uint64_t word = crc ^ ((uint64_t const *)data)[i];
crc = crc32c_table[7][word & 0xff] ^
crc32c_table[6][(word >> 8) & 0xff] ^
crc32c_table[5][(word >> 16) & 0xff] ^
crc32c_table[4][(word >> 24) & 0xff] ^
crc32c_table[3][(word >> 32) & 0xff] ^
crc32c_table[2][(word >> 40) & 0xff] ^
crc32c_table[1][(word >> 48) & 0xff] ^
crc32c_table[0][word >> 56];
}
data += n << 3;
len &= 7;
while (len) {
len--;
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
}
return crc;
}
// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
}
// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;
// Pre-process the crc.
uint64_t crc0 = crc ^ 0xffffffff;
// Compute the crc for up to seven leading bytes, bringing the data pointer
// to an eight-byte boundary.
unsigned char const *next = buf;
while (len && ((uintptr_t)next & 7) != 0) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}
// Compute the crc on sets of LONG*3 bytes, making use of three ALUs in
// parallel on a single core.
while (len >= LONG*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + LONG;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" LONGx1 "(%3), %1\n\t"
"crc32q\t" LONGx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
next += LONG*2;
len -= LONG*3;
}
// Do the same thing, but now on SHORT*3 blocks for the remaining data less
// than a LONG*3 block.
while (len >= SHORT*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + SHORT;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" SHORTx1 "(%3), %1\n\t"
"crc32q\t" SHORTx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
next += SHORT*2;
len -= SHORT*3;
}
// Compute the crc on the remaining eight-byte units less than a SHORT*3
// block.
unsigned char const *end = next + (len - (len & 7));
while (next < end) {
__asm__("crc32q\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next += 8;
}
len &= 7;
// Compute the crc for up to seven trailing bytes.
while (len) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}
// Return the crc, post-processed.
return ~(uint32_t)crc0;
}
// Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008. This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors. cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
do { \
uint32_t eax, ecx; \
eax = 1; \
__asm__("cpuid" \
: "=c"(ecx) \
: "a"(eax) \
: "%ebx", "%edx"); \
(have) = (ecx >> 20) & 1; \
} while (0)
// Compute a CRC-32C. If the crc32 instruction is available, use the hardware
// version. Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
int sse42;
SSE42(sse42);
return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
}
Code to generate crc32c.h (stackoverflow won't let me post the tables themselves, due to a 30,000 character limit in an answer):
// Generate crc32c.h for crc32c.c.
#include <stdio.h>
#include <stdint.h>
#define LONG 8192
#define SHORT 256
// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
printf("static uint32_t const %s[][%zu] = {\n", name, cols);
size_t end = rows * cols;
size_t k = 0;
for (;;) {
fputs(" {", stdout);
size_t n = 0, j = 0;
for (;;) {
printf("0x%08x", tab[k + n]);
if (++n == cols)
break;
putchar(',');
if (++j == 6) {
fputs("\n ", stdout);
j = 0;
}
putchar(' ');
}
k += cols;
if (k == end)
break;
puts("},");
}
puts("}\n};");
}
/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78
static void crc32c_word_table(void) {
uint32_t table[8][256];
// Generate byte-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~n;
for (unsigned k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
table[0][n] = ~crc;
}
// Use byte-wise table to generate word-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~table[0][n];
for (unsigned k = 1; k < 8; k++) {
crc = table[0][crc & 0xff] ^ (crc >> 8);
table[k][n] = ~crc;
}
}
// Print table.
print_table(table[0], 8, 256, "crc32c_table");
}
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) {
uint32_t prod = 0;
for (;;) {
if (a & 0x80000000) {
prod ^= b;
if ((a & 0x7fffffff) == 0)
break;
}
a <<= 1;
b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
}
return prod;
}
/* Take a length and build four lookup tables for applying the zeros operator
for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) {
// Generate operator for len zeros.
uint32_t op = 0x80000000; // 1 (x^0)
uint32_t sq = op >> 4; // x^4
while (len) {
sq = multmodp(sq, sq); // x^2^(k+3), k == len bit position
if (len & 1)
op = multmodp(sq, op);
len >>= 1;
}
// Generate table to update each byte of a CRC using op.
uint32_t table[4][256];
for (unsigned n = 0; n < 256; n++) {
table[0][n] = multmodp(op, n);
table[1][n] = multmodp(op, n << 8);
table[2][n] = multmodp(op, n << 16);
table[3][n] = multmodp(op, n << 24);
}
// Print the table to stdout.
print_table(table[0], 4, 256, name);
}
int main(void) {
puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
);
crc32c_word_table();
puts(
"\n// Block sizes for three-way parallel crc computation. LONG and SHORT\n"
"// must both be powers of two. The associated string constants must be set\n"
"// accordingly, for use in constructing the assembler instructions."
);
printf("#define LONG %d\n", LONG);
printf("#define LONGx1 \"%d\"\n", LONG);
printf("#define LONGx2 \"%d\"\n", 2 * LONG);
printf("#define SHORT %d\n", SHORT);
printf("#define SHORTx1 \"%d\"\n", SHORT);
printf("#define SHORTx2 \"%d\"\n", 2 * SHORT);
puts(
"\n// Table to shift a CRC-32C by LONG bytes."
);
crc32c_zero_table(8192, "crc32c_long");
puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
);
crc32c_zero_table(256, "crc32c_short");
return 0;
}
Mark Adler's answer is correct and complete, but those seeking quick and easy way to integrate CRC-32C in their application might find it a little difficult to adapt the code, especially if they are using Windows and .NET.
I've created a library that implements CRC-32C using either hardware or software method depending on available hardware. It's available as a NuGet package for C++ and .NET. It's opensource of course.
Besides packaging Mark Adler's code above, I've found a simple way to improve throughput of the software fallback by 50%. On my computer, the library now achieves 2 GB/s in software and over 20 GB/s in hardware. For those curious, here's the optimized software implementation:
static uint32_t append_table(uint32_t crci, buffer input, size_t length)
{
buffer next = input;
#ifdef _M_X64
uint64_t crc;
#else
uint32_t crc;
#endif
crc = crci ^ 0xffffffff;
#ifdef _M_X64
while (length && ((uintptr_t)next & 7) != 0)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
while (length >= 16)
{
crc ^= *(uint64_t *)next;
uint64_t high = *(uint64_t *)(next + 8);
crc = table[15][crc & 0xff]
^ table[14][(crc >> 8) & 0xff]
^ table[13][(crc >> 16) & 0xff]
^ table[12][(crc >> 24) & 0xff]
^ table[11][(crc >> 32) & 0xff]
^ table[10][(crc >> 40) & 0xff]
^ table[9][(crc >> 48) & 0xff]
^ table[8][crc >> 56]
^ table[7][high & 0xff]
^ table[6][(high >> 8) & 0xff]
^ table[5][(high >> 16) & 0xff]
^ table[4][(high >> 24) & 0xff]
^ table[3][(high >> 32) & 0xff]
^ table[2][(high >> 40) & 0xff]
^ table[1][(high >> 48) & 0xff]
^ table[0][high >> 56];
next += 16;
length -= 16;
}
#else
while (length && ((uintptr_t)next & 3) != 0)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
while (length >= 12)
{
crc ^= *(uint32_t *)next;
uint32_t high = *(uint32_t *)(next + 4);
uint32_t high2 = *(uint32_t *)(next + 8);
crc = table[11][crc & 0xff]
^ table[10][(crc >> 8) & 0xff]
^ table[9][(crc >> 16) & 0xff]
^ table[8][crc >> 24]
^ table[7][high & 0xff]
^ table[6][(high >> 8) & 0xff]
^ table[5][(high >> 16) & 0xff]
^ table[4][high >> 24]
^ table[3][high2 & 0xff]
^ table[2][(high2 >> 8) & 0xff]
^ table[1][(high2 >> 16) & 0xff]
^ table[0][high2 >> 24];
next += 12;
length -= 12;
}
#endif
while (length)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
return (uint32_t)crc ^ 0xffffffff;
}
As you can see, it merely crunches larger block at a time. It needs larger lookup table, but it's still cache-friendly. The table is generated the same way, only with more rows.
One extra thing I explored is the use of PCLMULQDQ instruction to get hardware acceleration on AMD processors. I've managed to port Intel's CRC patch for zlib (also available on GitHub) to CRC-32C polynomial except the magic constant 0x9db42487. If anyone is able to decipher that one, please let me know. After supersaw7's excellent explanation on reddit, I have ported also the elusive 0x9db42487 constant and I just need to find some time to polish and test it.
First of all the Intel's CRC32 instruction serves to calculate CRC-32C (that is uses a different polynomial that regular CRC32. Look at the Wikipedia CRC32 entry)
To use Intel's hardware acceleration for CRC32C using gcc you can:
Inline assembly language in C code via the asm statement
Use intrinsics _mm_crc32_u8, _mm_crc32_u16, _mm_crc32_u32 or _mm_crc32_u64. See Intel Intrinsics Guide for a description of those for the Intel's compiler icc but gcc also implements them.
This is how you would do it with __mm_crc32_u8 that takes one byte at a time, using __mm_crc32_u64 would give further performance improvement since it takes 8 bytes at a time.
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
uint32_t hash = 0;
size_t i = 0;
for (i=0;i<len;i++) {
hash = _mm_crc32_u8(hash, bytes[i]);
}
return hash;
}
To compile this you need to pass -msse4.2 in CFLAGS. Like gcc -g -msse4.2 test.c otherwise it will complain about undefined reference to _mm_crc32_u8.
If you want to revert to a plain C implementation if the instruction is not available in the platform where the executable is running you can use GCC's ifunc attribute. Like
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
/* use _mm_crc32_u* here */
}
uint32_t default_crc32(const uint8_t *bytes, size_t len)
{
/* pure C implementation */
}
/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) {
__builtin_cpu_init();
if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;
return default_crc32;
}
/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));
I compare various algorithms here: https://github.com/htot/crc32c
The fastest algorithm has been taken from Intels crc_iscsi_v_pcl.asm assembly code (which is available in a modified form in the linux kernel) and using a C wrapper (crcintelasm.cc) included into this project.
To be able to run this code on 32 bit platforms first it has been ported to C (crc32intelc) where possible, a small amount of inline assembly is required. Certain parts of the code depend on the bitness, crc32q is not available on 32 bits and neither is movq, these are put in macro's (crc32intel.h) with alternative code for 32 bit platforms.
static uint32_t get_num(void);
uint32_t get_count(unsigned int mask)
{
uint8_t i = 0;
uint32_t count = 0;
while (i < get_num())
{
if (mask & (1 << i++))
count++;
}
return count;
}
In this code, what would be more safe (1L << i++) or (1UL << i++) ?
An unsigned operand is a bit safer because only then is the behavior of all the shifts defined when get_num() returns the number of bits in that operand's type. If unsigned long is wider than unsigned int then UL is slightly safer than just U, but only for accommodating get_num() results that aren't valid anyway.
Safer yet, however, is this:
uint32_t get_count(uint32_t mask)
{
uint8_t num = get_num();
if (num == 0) return 0;
/* mask off the bits we don't want to count */
mask &= ~((uint32_t) 0) >> ((num < 32) ? (32 - num) : 0);
/* count the remaining 1 bits in mask, leaving the result in mask */
mask = (mask & 0x55555555) + ((mask & 0xaaaaaaaa) >> 1);
mask = (mask & 0x33333333) + ((mask & 0xcccccccc) >> 2);
mask = (mask & 0x0f0f0f0f) + ((mask & 0xf0f0f0f0) >> 4);
mask = (mask & 0x00ff00ff) + ((mask & 0xff00ff00) >> 8);
mask = (mask & 0x0000ffff) + ((mask & 0xffff0000) >> 16);
return mask;
}
If you just want to count the 1-bits in an uint and use gcc, you should have a look at the builtin functions (here: int __builtin_popcount (unsigned int x)). These can be expected to be highly optimized and even use special instructions of the CPU where available. (one could very wenn test for gcc).
However, not sure what get_num() would yield - it just seems not to depend on mask, so its output can be used to limit the result of popcount.
The following uses a loop and might be faster than a parallel-add tree on some architectures (one should profile both versions if timing is essential).
unsigned popcount(uint32_t value, unsigned width)
{
unsigned cnt = 0; // actual size intentionally by arch
if ( width < 32 )
value &= (1UL << width) - 1; // limit to actual width
for ( ; value ; value >>= 1 ) {
cnt += value & 1U; // avoids a branch
}
return cnt;
}
Note the width is passed to the function.
On architectures with < 32 bits (PIC, AVR, MSP430, etc.) specialized versions will gain much better results than a single version.