Checksum algorithm based on J.G. Fletcher

Checksum algorithm based on J.G. Fletcher - c

I have been tasked in implementing a Checksum algorithm that is based on the J.G. Fletcher checksum and ISO 8473-1:1998 and is described like so :
They then list 4 data that can be checked to see if the algo is correct but my version fails at the last two values.
0000 gives a checksum of FFFF
0000'00 gives a checksum of FFFF
ABCDEF'01 gives a checksum of 9CF8
1456'F89A'0001 gives a checksum of 24DC
I've been working on this for hours now and can't find what I did wrong, a new set of eyes could help tremendously.
Here is my function:
uint16 Crc_CalculateISOChecksum(uint8 *pt_start_address, uint32 length)
{
uint8 C0, C1;
uint8 data;
uint32 i;
uint8 ck1, ck2;
/* Initial value */
C0 = 0;
C1 = 0;
/* memories - 32bits wide*/
for (i=0; i<length; i++) /* nb_bytes has been verified */
{
data = pt_start_address[i];
C0 = (C0 + data)%255;
C1 = (C1 + C0)%255;
}
/* Calculate the intermediate ISO checksum value */
ck1 = (unsigned char)(255-((C0+C1)%255));
ck2 = (unsigned char)(C1%255);
if (ck1 == 0)
{
ck1 = MASK_BYTE_LSB;
}
if (ck2 == 0)
{
ck2 = MASK_BYTE_LSB;
}
return ((((uint16)ck1)<<8) | ((uint16)ck2));
}

Your intermediate sums should be uint16_t (or uint16 in your lingo).
uint16_t C0, C1; // Not uint8_t.
Depending on what char and int on your system are (e.g. do not assume that int has more bits than char) your intermediate sums may be overflowing. Your implementation relies on uint8_t being promoted.
To illustrate:
0xFF 0xFF
+0xFF +0xFF
===== =====
0x1FE % 255 = 0 0xFE % 255 = 254
^Retain ^Drop

Just stumbled upon this. If someone is still interested: You iterate in the wrong direction.
Do NOT iterate from 0 to length-1 but from length-1 to 0, then it will work.
for (i = length-1; i >= 0; i--) // and change i to 'signed'

Related

How to sum values in a sequence of bytes in C

I am trying to figure out how to add sequential bytes in a data block starting at a given offset(sequenceOffset) to sequenceLength, by typcasting them to signed 16 bit integers(int16_t). The numbers can be negative and positive. I feel like i am not incrementing the offset properly but cannot figure out how it is meant to be done.
For example:
Summing sequence of 8 bytes at offset 53:
57 AB 2A 2C 4E A4 7A 64
-21673 11306 -23474 25722
You said the sum is: 22848
Should be: -8119
int16_t sumSequence16(const uint8_t* const blockAddress, uint32_t blockLength, uint32_t sequenceOffset,
uint8_t sequenceLength) {
int count = 0;
for (int i = 0; i < blockLength; i++) {
if (*(blockAddress + i) == sequenceOffset) {
count += (int16_t*)(&sequenceOffset);
sequenceOffset++;
}
}
return count;
}

There are some serious problems with your code.
Start by noticing that your code doesn't use sequenceLength at all - that's strange.
Then there is no need to loop over the whole block - you only need to look at the bytes inside the relevant sequence.
This line is very strange:
if (*(blockAddress + i) == sequenceOffset)
^^^^^^^^^^^^^^^^^^^
Reads the data at index i
It compare a data value inside the data block with the sequenceOffset - that doesn't seem correct.
And this part:
(int16_t*)(&sequenceOffset);
is actually a violation of the strict aliasing rule.
Finally, you never mention which endianess the data is stored with. From your example it seems to be little endian so I'll use little endian in the code below:
int16_t sumSequence16(const uint8_t* const blockAddress,
const uint32_t sequenceOffset,
const uint8_t sequenceLength)
{
uint8_t* p = blockAddress + sequenceOffset; // Point to first byte in sequence
int sum = 0;
for (uint8_t i = 0; i < sequenceLength; i += 2)
{
int16_t t = 0;
t = p[i+1]; // Read MSB
t = t << 8; // Shift MSB 8 bits to the left
t = t | p[i]; // Add LSB
sum = sum + t; // Update the running sum
}
return sum;
}

Converting a checksum algorithm from Python to C

There is a checksum algorithm for the networks in some Honda vehicles that computes an integer between 0-15 for the provided data. I'm trying to convert it to plain C, but I think I'm missing something, as I get different results in my implementation.
While the Python algorithm computes 6 for "ABC", mine computes -10, which is weird. Am I messing something up with the bit shifting?
The Python algorithm:
def can_cksum(mm):
s = 0
for c in mm:
c = ord(c)
s += (c>>4)
s += c & 0xF
s = 8-s
s %= 0x10
return s
My version, in C:
int can_cksum(unsigned char * data, unsigned int len) {
int result = 0;
for (int i = 0; i < len; i++) {
result += data[i] >> 4;
result += data[i] & 0xF;
}
result = 8 - result;
result %= 0x10;
return result;
}

No, the problem is the modulus. Python follows the sign of the right operand, and C follows the sign of the left. Mask with 0x0f instead to avoid this.
result = 8 - result;
result &= 0x0f;

CRC32 calculation with CRC hash at the beginning of the message in C

I need to calculate CRC of the message and put it at the beginning of this message, so that the final CRC of the message with 'prepended' patch bytes equals 0. I was able to do this very easily with the help of few articles, but not for my specific parameters. The thing is that I have to use a given CRC32 algorithm which calculates the CRC of the memory block, but I don't have that 'reverse' algorithm that calculates those 4 patch bytes/'kind of CRC'. Parameters of the given CRC32 algorithm are:
Polynomial: 0x04C11DB7
Endianess: big-endian
Initial value: 0xFFFFFFFF
Reflected: false
XOR out with: 0L
Test stream: 0x0123, 0x4567, 0x89AB, 0xCDEF results in CRC = 0x612793C3
The code to calculate the CRC (half-byte, table-driven, I hope data type definitions are self-explanatory):
uint32 crc32tab(uint16* data, uint32 len, uint32 crc)
{
uint8 nibble;
int i;
while(len--)
{
for(i = 3; i >= 0; i--)
{
nibble = (*data >> i*4) & 0x0F;
crc = ((crc << 4) | nibble) ^ tab[crc >> 28];
}
data++;
}
return crc;
}
The table needed is (I thougth the short [16] table should contain every 16th element from the large [256] table, but this table contains actually first 16 elements, but that's how it was provided to me):
static const uint32 tab[16]=
{
0x00000000, 0x04C11DB7, 0x09823B6E, 0x0D4326D9,
0x130476DC, 0x17C56B6B, 0x1A864DB2, 0x1E475005,
0x2608EDB8, 0x22C9F00F, 0x2F8AD6D6, 0x2B4BCB61,
0x350C9B64, 0x31CD86D3, 0x3C8EA00A, 0x384FBDBD
};
I modified the code so it's not so long, but the functionality stays the same. The problem is that this forward CRC calculation looks more like backward/reverse CRC calc.
I've spent almost a week trying to find out the correct polynomial/algorithm/table combination, but with no luck. If it helps, I came up with bit-wise algorithm that corresponds to table-driven code above, although that was not so hard after all:
uint32 crc32(uint16* data, uint32 len, uint32 crc)
{
uint32 i;
while(len--)
{
for(i = 0; i < 16; i++)
{
// #define POLY 0x04C11DB7
crc = (crc << 1) ^ (((crc ^ *data) & 0x80000000) ? POLY : 0);
}
crc ^= *data++;
}
return crc;
}
Here are expected results - first 2 16-bit words make the needed unknown CRC and the rest is the known data itself (by feeding these examples to provided algorithm, the result is 0).
{0x3288, 0xD244, 0xCDEF, 0x89AB, 0x4567, 0x0123}
{0xC704, 0xDD7B, 0x0000} - append as many zeros as you like, the result is the same
{0xCEBD, 0x1ADD, 0xFFFF}
{0x81AB, 0xB932, 0xFFFF, 0xFFFF}
{0x0857, 0x0465, 0x0000, 0x0123}
{0x1583, 0xD959, 0x0123}
^ ^
| |
unknown bytes that I need to calculate
I think testing this on 0xFFFF or 0x0000 words is convenient because the direction of calculation and endianess is not important (I hope :D). So be careful to use other test bytes, because the direction of calculation is quite devious :D. Also you can see that by feeding only zeros to the algorithm (both forward and backward), the result is so-called residue (0xC704DD7B), that may be helpful.
So...I wrote at least 10 different functions (bite-wise, tables, combination of polynomials etc.) trying to solve this, but with no luck. I give you here the function in which I put my hopes into. It's 'reversed' algorithm of the table-driven one above, with different table of course. The problem is that the only correct CRC I get from that is with all 0s message and that's not so unexpected. Also I have written the reversed implementation of the bit-wise algorithm (reversed shifts, etc.), but that one returns only the first byte correctly.
Here is the table-driven one, pointer to data should point to the last element of the message and crc input should be the requested crc (0s for the whole message or you can maybe take another approach - that the last 4 bytes of message are the CRC you are looking for: Calculating CRC initial value instead of appending the CRC to payload) :
uint32 crc32tabrev(uint16* data, uint32 len, uint32 crc)
{
uint8 nibble;
int i;
while(len--)
{
for(i = 0; i < 4; i++)
{
nibble = (*data >> i*4) & 0x0F;
crc = (crc >> 4) ^ revtab[((crc ^ nibble) & 0x0F)];
}
data--;
}
return reverse(crc); //reverse() flips all bits around center (MSB <-> LSB ...)
}
The table, which I hope is 'the chosen one':
static const uint32 revtab[16]=
{
0x00000000, 0x1DB71064, 0x3B6E20C8, 0x26D930AC,
0x76DC4190, 0x6B6B51F4, 0x4DB26158, 0x5005713C,
0xEDB88320, 0xF00F9344, 0xD6D6A3E8, 0xCB61B38C,
0x9B64C2B0, 0x86D3D2D4, 0xA00AE278, 0xBDBDF21C
};
As you can see, this algorithm has some perks which make me run in circles and I think I'm maybe on the right track, but I'm missing something. I hope an extra pair of eyes will see what I can not. I'm sorry for the long post (no potato :D), but I think all of that explanation was neccessary. Thank you in advance for insight or advice.

I will answer for your CRC specification, that of a CRC-32/MPEG-2. I will have to ignore your attempts at calculating that CRC, since they are incorrect.
Anyway, to answer your question, I happen to have written a program that solves this problem. It is called spoof.c. It very rapidly computes what bits to change in a message to get a desired CRC. It does this in order log(n) time, where n is the length of the message. Here is an example:
Let's take the nine-byte message 123456789 (those digits represented in ASCII). We will prepend it with four zero bytes, which we will change to get the desired CRC at the end. The message in hex is then: 00 00 00 00 31 32 33 34 35 36 37 38 39. Now we compute the CRC-32/MPEG-2 for that message. We get 373c5870.
Now we run spoof with this input, which is the CRC length in bits, the fact that it is not reflected, the polynomial, the CRC we just computed, the length of the message in bytes, and all 32 bit locations in the first four bytes (which is what we are allowing spoof to change):
32 0 04C11DB7
373c5870 13
0 0 1 2 3 4 5 6 7
1 0 1 2 3 4 5 6 7
2 0 1 2 3 4 5 6 7
3 0 1 2 3 4 5 6 7
It gives this output with what bits in those first four bytes to set:
invert these bits in the sequence:
offset bit
0 1
0 2
0 4
0 5
0 6
1 0
1 2
1 5
1 7
2 0
2 2
2 5
2 6
2 7
3 0
3 1
3 2
3 4
3 5
3 7
We then set the first four bytes to: 76 a5 e5 b7. We then test by computing the CRC-32/MPEG-2 of the message 76 a5 e5 b7 31 32 33 34 35 36 37 38 39 and we get 00000000, the desired result.
You can adapt spoof.c to your application.
Here is an example that correctly computes the CRC-32/MPEG-2 on a stream of bytes using a bit-wise algorithm:
uint32_t crc32m(uint32_t crc, const unsigned char *buf, size_t len)
{
int k;
while (len--) {
crc ^= (uint32_t)(*buf++) << 24;
for (k = 0; k < 8; k++)
crc = crc & 0x80000000 ? (crc << 1) ^ 0x04c11db7 : crc << 1;
}
return crc;
}
and with a nybble-wise algorithm using the table in the question (which is correct):
uint32_t crc_table[] = {
0x00000000, 0x04C11DB7, 0x09823B6E, 0x0D4326D9,
0x130476DC, 0x17C56B6B, 0x1A864DB2, 0x1E475005,
0x2608EDB8, 0x22C9F00F, 0x2F8AD6D6, 0x2B4BCB61,
0x350C9B64, 0x31CD86D3, 0x3C8EA00A, 0x384FBDBD
};
uint32_t crc32m_nyb(uint32_t crc, const unsigned char *buf, size_t len)
{
while (len--) {
crc ^= (uint32_t)(*buf++) << 24;
crc = (crc << 4) ^ crc_table[crc >> 28];
crc = (crc << 4) ^ crc_table[crc >> 28];
}
return crc;
}
In both cases, the initial CRC must be 0xffffffff.

Alternate approach. Assumes xorout = 0, if not, then after calculating the normal crc, then crc ^= xorout to remove it. The method here multiplies the normal crc by (1/2)%(crc polynomial) raised to (message size in bits) power % (crc polynomial) equivalent to cycling it backwards. If the message size is fixed, then the mapping is fixed and time complexity is O(1). Otherwise, it's O(log(n)).
This example code uses Visual Studio and an intrinsic for carryless multiply (PCLMULQDQ), which uses XMM (128 bit) registers. Visual Studio uses __m128i type to represent integer XMM values.
#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
#define POLY (0x104c11db7ull)
#define POLYM ( 0x04c11db7u)
static uint32_t crctbl[256];
static __m128i poly; /* poly */
static __m128i invpoly; /* 2^64 / POLY */
void GenMPoly(void) /* generate __m128i poly info */
{
uint64_t N = 0x100000000ull;
uint64_t Q = 0;
for(size_t i = 0; i < 33; i++){
Q <<= 1;
if(N&0x100000000ull){
Q |= 1;
N ^= POLY;
}
N <<= 1;
}
poly.m128i_u64[0] = POLY;
invpoly.m128i_u64[0] = Q;
}
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
/* assumes twos complement */
crc = (crc<<1)^((0-(crc>>31))&POLYM);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0xffffffffu;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo poly */
uint32_t MpyModPoly(uint32_t a, uint32_t b) /* (a*b)%poly */
{
__m128i ma, mb, mp, mt;
ma.m128i_u64[0] = a;
mb.m128i_u64[0] = b;
mp = _mm_clmulepi64_si128(ma, mb, 0x00); /* p[0] = a*b */
mt = _mm_clmulepi64_si128(mp, invpoly, 0x00); /* t[1] = (p[0]*((2^64)/POLY))>>64 */
mt = _mm_clmulepi64_si128(mt, poly, 0x01); /* t[0] = t[1]*POLY */
return mp.m128i_u32[0] ^ mt.m128i_u32[0]; /* ret = p[0] ^ t[0] */
}
/* exponentiate by repeated squaring modulo poly */
uint32_t PowModPoly(uint32_t a, uint32_t b) /* pow(a,b)%poly */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = a; /* current square */
while(b){
if(b&1)
prd = MpyModPoly(prd, sqr);
sqr = MpyModPoly(sqr, sqr);
b >>= 1;
}
return prd;
}
int main()
{
uint32_t inv; /* 1/2 % poly, constant */
uint32_t fix; /* fix value, constant if msg size fixed */
uint32_t crc; /* crc at end of msg */
uint32_t pre; /* prefix for msg */
uint8_t msg[13] = {0x00,0x00,0x00,0x00,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39};
GenMPoly(); /* generate __m128i polys */
GenTbl(); /* generate crc table */
inv = PowModPoly(2, 0xfffffffeu); /* inv = 2^(2^32-2) % Poly = 1/2 % poly */
fix = PowModPoly(inv, 8*sizeof(msg)); /* fix value */
crc = GenCrc(msg, sizeof(msg)); /* calculate normal crc */
pre = MpyModPoly(fix, crc); /* convert to prefix */
printf("crc = %08x pre = %08x ", crc, pre);
msg[0] = (uint8_t)(pre>>24); /* store prefix in msg */
msg[1] = (uint8_t)(pre>>16);
msg[2] = (uint8_t)(pre>> 8);
msg[3] = (uint8_t)(pre>> 0);
crc = GenCrc(msg, sizeof(msg)); /* check result */
if(crc == 0)
printf("passed\n");
else
printf("failed\n");
return 0;
}

Well, few hours after my question, someone whose name I don't remember posted an answer to my question which turned out to be correct. Somehow this answer got completely deleted, I don't know why or who did it, but I'd like to thank to this person and in the case you will see this, please post your answer again and I'll delete this one. But for other users, here's his answer that worked for me, thank you again, mysterious one (unfortunately, I can't replicate his notes and suggestions well enough, just the code itself):
Edit: The original answer came from user samgak, so this stays here until he'll post his answer.
The reverse CRC algorithm:
uint32 revcrc32(uint16* data, uint32 len, uint32 crc)
{
uint32 i;
data += len - 1;
while(len--)
{
crc ^= *data--;
for(i = 0; i < 16; i++)
{
uint32 crc1 = ((crc ^ POLY) >> 1) | 0x80000000;
uint32 crc2 = crc >> 1;
if(((crc1 << 1) ^ (((crc1 ^ *data) & 0x80000000) ? POLY : 0)) == crc)
crc = crc1;
else if(((crc2 << 1) ^ (((crc2 ^ *data) & 0x80000000) ? POLY : 0)) == crc)
crc = crc2;
}
}
return crc;
}
Find patch bytes:
#define CRC_OF_ZERO 0xb7647d
void bruteforcecrc32(uint32 targetcrc)
{
// compute prefixes:
uint16 j;
for(j = 0; j <= 0xffff; j++)
{
uint32 crc = revcrc32(&j, 1, targetcrc);
if((crc >> 16) == (CRC_OF_ZERO >> 16))
{
printf("prefixes: %04lX %04lX\n", (crc ^ CRC_OF_ZERO) & 0xffff, (uint32)j);
return;
}
}
}
Usage:
uint16 test[] = {0x0123, 0x4567, 0x89AB, 0xCDEF}; // prefix should be 0x0CD8236A
bruteforcecrc32(revcrc32(test, 4, 0L));

Efficient algorithm for finding a byte in a bit array

Given a bytearray uint8_t data[N] what is an efficient method to find a byte uint8_t search within it even if search is not octet aligned? i.e. the first three bits of search could be in data[i] and the next 5 bits in data[i+1].
My current method involves creating a bool get_bit(const uint8_t* src, struct internal_state* state) function (struct internal_state contains a mask that is bitshifted right, &ed with src and returned, maintaining size_t src_index < size_t src_len) , leftshifting the returned bits into a uint8_t my_register and comparing it with search every time, and using state->src_index and state->src_mask to get the position of the matched byte.
Is there a better method for this?

If you're searching an eight bit pattern within a large array you can implement a sliding window over 16 bit values to check if the searched pattern is part of the two bytes forming that 16 bit value.
To be portable you have to take care of endianness issues which is done by my implementation by building the 16 bit value to search for the pattern manually. The high byte is always the currently iterated byte and the low byte is the following byte. If you do a simple conversion like value = *((unsigned short *)pData) you will run into trouble on x86 processors...
Once value, cmp and mask are setup cmp and mask are shifted. If the pattern was not found within hi high byte the loop continues by checking the next byte as start byte.
Here is my implementation including some debug printouts (the function returns the bit position or -1 if pattern was not found):
int findPattern(unsigned char *data, int size, unsigned char pattern)
{
int result = -1;
unsigned char *pData;
unsigned char *pEnd;
unsigned short value;
unsigned short mask;
unsigned short cmp;
int tmpResult;
if ((data != NULL) && (size > 0))
{
pData = data;
pEnd = data + size;
while ((pData < pEnd) && (result == -1))
{
printf("\n\npData = {%02x, %02x, ...};\n", pData[0], pData[1]);
if ((pData + 1) < pEnd) /* still at least two bytes to check? */
{
tmpResult = (int)(pData - data) * 8; /* calculate bit offset according to current byte */
/* avoid endianness troubles by "manually" building value! */
value = *pData << 8;
pData++;
value += *pData;
/* create a sliding window to check if search patter is within value */
cmp = pattern << 8;
mask = 0xFF00;
while (mask > 0x00FF) /* the low byte is checked within next iteration! */
{
printf("cmp = %04x, mask = %04x, tmpResult = %d\n", cmp, mask, tmpResult);
if ((value & mask) == cmp)
{
result = tmpResult;
break;
}
tmpResult++; /* count bits! */
mask >>= 1;
cmp >>= 1;
}
}
else
{
/* only one chance left if there is only one byte left to check! */
if (*pData == pattern)
{
result = (int)(pData - data) * 8;
}
pData++;
}
}
}
return (result);
}

I don't think you can do much better than this in C:
/*
* Searches for the 8-bit pattern represented by 'needle' in the bit array
* represented by 'haystack'.
*
* Returns the index *in bits* of the first appearance of 'needle', or
* -1 if 'needle' is not found.
*/
int search(uint8_t needle, int num_bytes, uint8_t haystack[num_bytes]) {
if (num_bytes > 0) {
uint16_t window = haystack[0];
if (window == needle) return 0;
for (int i = 1; i < num_bytes; i += 1) {
window = window << 8 + haystack[i];
/* Candidate for unrolling: */
for (int j = 7; j >= 0; j -= 1) {
if ((window >> j) & 0xff == needle) {
return 8 * i - j;
}
}
}
}
return -1;
}
The main idea is to handle the 87.5% of cases that cross the boundary between consecutive bytes by pairing bytes in a wider data type (uint16_t in this case). You could adjust it to use an even wider data type, but I'm not sure that would gain anything.
What you cannot safely or easily do is anything involving casting part or all of your array to a wider integer type via a pointer (i.e. (uint16_t *)&haystack[i]). You cannot be ensured of proper alignment for such a cast, nor of the byte order with which the result might be interpreted.

I don't know if it would be better, but i would use sliding window.
uint counter = 0, feeder = 8;
uint window = data[0];
while (search ^ (window & 0xff)){
window >>= 1;
feeder--;
if (feeder < 8){
counter++;
if (counter >= data.length) {
feeder = 0;
break;
}
window |= data[counter] << feeder;
feeder += 8;
}
}
//Returns index of first bit of first sequence occurrence or -1 if sequence is not found
return (feeder > 0) ? (counter+1)*8-feeder : -1;
Also with some alterations you can use this method to search for arbitrary length (1 to 64-array_element_size_in_bits) bits sequence.

If AVX2 is acceptable (with earlier versions it didn't work out so well, but you can still do something there), you can search in a lot of places at the same time. I couldn't test this on my machine (only compile) so the following is more to give to you an idea of how it could be approached than copy&paste code, so I'll try to explain it rather than just code-dump.
The main idea is to read an uint64_t, shift it right by all values that make sense (0 through 7), then for each of those 8 new uint64_t's, test whether the byte is in there. Small complication: for the uint64_t's shifted by more than 0, the highest position should not be counted since it has zeroes shifted into it that might not be in the actual data. Once this is done, the next uint64_t should be read at an offset of 7 from the current one, otherwise there is a boundary that is not checked across. That's fine though, unaligned loads aren't so bad anymore, especially if they're not wide.
So now for some (untested, and incomplete, see below) code,
__m256i needle = _mm256_set1_epi8(find);
size_t i;
for (i = 0; i < n - 6; i += 7) {
// unaligned load here, but that's OK
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
// in the qword right-shifted by 0, all positions are valid
// otherwise, the top position corresponds to an incomplete byte
uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low);
uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
// the bit-index and byte-index are swapped
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}
The funny "bit-index and byte-index are swapped" thing is because searching within a qword is done byte by byte and the results of those comparisons end up in 8 adjacent bits, while the search for "shifted by 1" ends up in the next 8 bits and so on. So in the resulting masks, the index of the byte that contains the 1 is a bit-offset, but the bit-index within that byte is actually the byte-offset, for example 0x8000 would correspond to finding the byte at the 7th byte of the qword that was right-shifted by 1, so the actual index is 8*7+1.
There is also the issue of the "tail", the part of the data left over when all blocks of 7 bytes have been processed. It can be done much the same way, but now more positions contain bogus bytes. Now n - i bytes are left over, so the mask has to have n - i bits set in the lowest byte, and one fewer for all other bytes (for the same reason as earlier, the other positions have zeroes shifted in). Also, if there is exactly 1 byte "left", it isn't really left because it would have been tested already, but that doesn't really matter. I'll assume the data is sufficiently padded that accessing out of bounds doesn't matter. Here it is, untested:
if (i < n - 1) {
// make n-i-1 bits, then copy them to every byte
uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101;
// the lowest position has an extra valid bit, set lowest zero
uint32_t validl = (validh + 1) | validh;
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
uint32_t lowmask = validl & _mm256_movemask_epi8(low);
uint32_t highmask = validh & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}

If you are searching a large amount of memory and can afford an expensive setup, another approach is to use a 64K lookup table. For each possible 16-bit value, the table stores a byte containing the bit shift offset at which the matching octet occurs (+1, so 0 can indicate no match). You can initialize it like this:
uint8_t* g_pLookupTable = malloc(65536);
void initLUT(uint8_t octet)
{
memset(g_pLookupTable, 0, 65536); // zero out
for(int i = 0; i < 65536; i++)
{
for(int j = 7; j >= 0; j--)
{
if(((i >> j) & 255) == octet)
{
g_pLookupTable[i] = j + 1;
break;
}
}
}
}
Note that the case where the value is shifted 8 bits is not included (the reason will be obvious in a minute).
Then you can scan through your array of bytes like this:
int findByteMatch(uint8_t* pArray, uint8_t octet, int length)
{
if(length >= 0)
{
uint16_t index = (uint16_t)pArray[0];
if(index == octet)
return 0;
for(int bit, i = 1; i < length; i++)
{
index = (index << 8) | pArray[i];
if(bit = g_pLookupTable[index])
return (i * 8) - (bit - 1);
}
}
return -1;
}
Further optimization:
Read 32 or however many bits at a time from pArray into a uint32_t and then shift and AND each to get byte one at a time, OR with index and test, before reading another 4.
Pack the LUT into 32K by storing a nybble for each index. This might help it squeeze into the cache on some systems.
It will depend on your memory architecture whether this is faster than an unrolled loop that doesn't use a lookup table.

Strip parity bits in C from 8 bits of data followed by 1 parity bit

I have a buffer of bits with 8 bits of data followed by 1 parity bit. This pattern repeats itself. The buffer is currently stored as an array of octets.
Example (p are parity bits):
0001 0001 p000 0100 0p00 0001 00p01 1100 ...
should become
0001 0001 0000 1000 0000 0100 0111 00 ...
Basically, I need to strip of every ninth bit to just obtain the data bits. How can I achieve this?
This is related to another question asked here sometime back.
This is on a 32 bit machine so the solution to the related question may not be applicable. The maximum possible number of bits is 45 i.e. 5 data octets
This is what I have tried so far. I have created a "boolean" array and added the bits into the array based on the the bitset of the octet. I then look at every ninth index of the array and through it away. Then move the remaining array down one index. Then I've got only the data bits left. I was thinking there may be better ways of doing this.

Your idea of having an array of bits is good. Just implement the array of bits by a 32-bit number (buffer).
To remove a bit from the middle of the buffer:
void remove_bit(uint32_t* buffer, int* occupancy, int pos)
{
assert(*occupancy > 0);
uint32_t high_half = *buffer >> pos >> 1;
uint32_t low_half = *buffer << (32 - pos) >> (32 - pos);
*buffer = high_half | low_half;
--*occupancy;
}
To add a byte to the buffer:
void add_byte(uint32_t* buffer, int* occupancy, uint8_t byte)
{
assert(*occupancy <= 24);
*buffer = (*buffer << 8) | byte;
*occupancy += 8;
}
To remove a byte from the buffer:
uint8_t remove_byte(uint32_t* buffer, int* occupancy)
{
uint8_t result = *buffer >> (*occupancy - 8);
assert(*occupancy >= 8);
*occupancy -= 8;
return result;
}
You will have to arrange the calls so that the buffer never overflows. For example:
buffer = 0;
occupancy = 0;
add_byte(buffer, occupancy, *input++);
add_byte(buffer, occupancy, *input++);
remove_bit(buffer, occupancy, 7);
*output++ = remove_byte(buffer, occupancy);
add_byte(buffer, occupancy, *input++);
remove_bit(buffer, occupancy, 6);
*output++ = remove_byte(buffer, occupancy);
... (there are only 6 input bytes, so this should be easy)

In pseudo-code (since you're not providing any proof you've tried something), I would probably do it like this, for simplicity:
View the data (with parity bits included) as a stream of bits
While there are bits left to read:
Read the next 8 bits
Write to the output
Read one more bit, and discard it
This "lifts you up" from worrying about reading bytes, which no longer is a useful operation since your bytes are interleaved with bits you want to discard.

I have written helper functions to read unaligned bit buffers (this was for AVC streams, see original source here). The code itself is GPL, I'm pasting interesting (modified) bits here.
typedef struct bit_buffer_ {
uint8_t * start;
size_t size;
uint8_t * current;
uint8_t read_bits;
} bit_buffer;
/* reads one bit and returns its value as a 8-bit integer */
uint8_t get_bit(bit_buffer * bb) {
uint8_t ret;
ret = (*(bb->current) >> (7 - bb->read_bits)) & 0x1;
if (bb->read_bits == 7) {
bb->read_bits = 0;
bb->current++;
}
else {
bb->read_bits++;
}
return ret;
}
/* reads up to 32 bits and returns the value as a 32-bit integer */
uint32_t get_bits(bit_buffer * bb, size_t nbits) {
uint32_t i, ret;
ret = 0;
for (i = 0; i < nbits; i++) {
ret = (ret << 1) + get_bit(bb);
}
return ret;
}
You can use the structure like this:
uint_8 * buffer;
size_t buffer_size;
/* assumes buffer points to your data */
bit_buffer bb;
bb.start = buffer;
bb.size = buffer_size;
bb.current = buffer;
bb.read_bits = 0;
uint32_t value = get_bits(&bb, 8);
uint8_t parity = get_bit(&bb);
uint32_t value2 = get_bits(&bb, 8);
uint8_t parity2 = get_bit(&bb);
/* etc */
I must stress that this code is quite perfectible, proper bound checking must be implemented, but it works fine in my use-case.
I leave it as an exercise to you to implement a proper bit buffer reader using this for inspiration.

This also works
void RemoveParity(unsigned char buffer[], int size)
{
int offset = 0;
int j = 0;
for(int i = 1; i + j < size; i++)
{
if (offset == 0)
{
printf("%u\n", buffer[i + j - 1]);
}
else
{
unsigned char left = buffer[i + j - 1] << offset;
unsigned char right = buffer[i + j] >> (8 - offset);
printf("%u\n", (unsigned char)(left | right));
}
offset++;
if (offset == 8)
{
offset = 0;
j++; // advance buffer (8 parity bit consumed)
}
}
}