Broken CRC32 Combine - zlib
I have a problem with a certain CRC method when trying to combine CRC's.
I have been using the combine CRC method and even adapted it a while ago to work with CRC16, etc. and I (hope?) understand how the filling of 0's work according to the answer here https://stackoverflow.com/a/23126768/5495036 with zlib's code
Thing is, I've been rapping my brain on how to get an operator that I can use as I am also working with a non standard CRC calculation. The original CRC calculation is based on the CRC32 0xEDB88320 polynomial for the lookup table but the calculation itself was broken, so instead of 256 byte lookup table looking like
0x00000000,0x77073096,0xee0e612c,0x990951ba,0x076dc419,0x706af48f,0xe963a535,... it now looks like this
0x00000000,0x96000000,0x30960000,0x07309600,0x77073096,0x2c770730,0x612c7707,... and the combine CRC uses the operator to be able to zero out the rest of the bits.
I can't change the calculation unfortunately so that idea is out :P
Any ideas?
EDIT:
The calculation is standard, it just uses a table that has been built using the standard polynomial of 0xEDB88320 but building it incorrectly. Table is still 256 ints. Starting CRC of 0xFFFFFFFF
Complete class code
public static byte[] ByteLookupTable =
{
0x00,0x00,0x00,0x00,0x96,0x30,0x07,0x77,0x2C,0x61,0x0E,0xEE,0xBA,0x51,
0x09,0x99,0x19,0xC4,0x6D,0x07,0x8F,0xF4,0x6A,0x70,0x35,0xA5,0x63,
0xE9,0xA3,0x95,0x64,0x9E,0x32,0x88,0xDB,0x0E,0xA4,0xB8,0xDC,0x79,0x1E,
0xE9,0xD5,0xE0,0x88,0xD9,0xD2,0x97,0x2B,0x4C,0xB6,0x09,0xBD,0x7C,
0xB1,0x7E,0x07,0x2D,0xB8,0xE7,0x91,0x1D,0xBF,0x90,0x64,0x10,0xB7,0x1D,
0xF2,0x20,0xB0,0x6A,0x48,0x71,0xB9,0xF3,0xDE,0x41,0xBE,0x84,0x7D,
0xD4,0xDA,0x1A,0xEB,0xE4,0xDD,0x6D,0x51,0xB5,0xD4,0xF4,0xC7,0x85,0xD3,
0x83,0x56,0x98,0x6C,0x13,0xC0,0xA8,0x6B,0x64,0x7A,0xF9,0x62,0xFD,
0xEC,0xC9,0x65,0x8A,0x4F,0x5C,0x01,0x14,0xD9,0x6C,0x06,0x63,0x63,0x3D,
0x0F,0xFA,0xF5,0x0D,0x08,0x8D,0xC8,0x20,0x6E,0x3B,0x5E,0x10,0x69,
0x4C,0xE4,0x41,0x60,0xD5,0x72,0x71,0x67,0xA2,0xD1,0xE4,0x03,0x3C,0x47,
0xD4,0x04,0x4B,0xFD,0x85,0x0D,0xD2,0x6B,0xB5,0x0A,0xA5,0xFA,0xA8,
0xB5,0x35,0x6C,0x98,0xB2,0x42,0xD6,0xC9,0xBB,0xDB,0x40,0xF9,0xBC,0xAC,
0xE3,0x6C,0xD8,0x32,0x75,0x5C,0xDF,0x45,0xCF,0x0D,0xD6,0xDC,0x59,
0x3D,0xD1,0xAB,0xAC,0x30,0xD9,0x26,0x3A,0x00,0xDE,0x51,0x80,0x51,0xD7,
0xC8,0x16,0x61,0xD0,0xBF,0xB5,0xF4,0xB4,0x21,0x23,0xC4,0xB3,0x56,
0x99,0x95,0xBA,0xCF,0x0F,0xA5,0xBD,0xB8,0x9E,0xB8,0x02,0x28,0x08,0x88,
0x05,0x5F,0xB2,0xD9,0x0C,0xC6,0x24,0xE9,0x0B,0xB1,0x87,0x7C,0x6F,
0x2F,0x11,0x4C,0x68,0x58,0xAB,0x1D,0x61,0xC1,0x3D,0x2D,0x66,0xB6,0x90,
0x41,0xDC,0x76,0x06,0x71,0xDB,0x01,0xBC,0x20,0xD2,0x98,0x2A,0x10,
0xD5,0xEF,0x89,0x85,0xB1,0x71,0x1F,0xB5,0xB6,0x06,0xA5,0xE4,0xBF,0x9F,
0x33,0xD4,0xB8,0xE8,0xA2,0xC9,0x07,0x78,0x34,0xF9,0x00,0x0F,0x8E,
0xA8,0x09,0x96,0x18,0x98,0x0E,0xE1,0xBB,0x0D,0x6A,0x7F,0x2D,0x3D,0x6D,
0x08,0x97,0x6C,0x64,0x91,0x01,0x5C,0x63,0xE6,0xF4,0x51,0x6B,0x6B,
0x62,0x61,0x6C,0x1C,0xD8,0x30,0x65,0x85,0x4E,0x00,0x62,0xF2,0xED,0x95,
0x06,0x6C,0x7B,0xA5,0x01,0x1B,0xC1,0xF4,0x08,0x82,0x57,0xC4,0x0F,
0xF5,0xC6,0xD9,0xB0,0x65,0x50,0xE9,0xB7,0x12,0xEA,0xB8,0xBE,0x8B,0x7C,
0x88,0xB9,0xFC,0xDF,0x1D,0xDD,0x62,0x49,0x2D,0xDA,0x15,0xF3,0x7C,
0xD3,0x8C,0x65,0x4C,0xD4,0xFB,0x58,0x61,0xB2,0x4D,0xCE,0x51,0xB5,0x3A,
0x74,0x00,0xBC,0xA3,0xE2,0x30,0xBB,0xD4,0x41,0xA5,0xDF,0x4A,0xD7,
0x95,0xD8,0x3D,0x6D,0xC4,0xD1,0xA4,0xFB,0xF4,0xD6,0xD3,0x6A,0xE9,0x69,
0x43,0xFC,0xD9,0x6E,0x34,0x46,0x88,0x67,0xAD,0xD0,0xB8,0x60,0xDA,
0x73,0x2D,0x04,0x44,0xE5,0x1D,0x03,0x33,0x5F,0x4C,0x0A,0xAA,0xC9,0x7C,
0x0D,0xDD,0x3C,0x71,0x05,0x50,0xAA,0x41,0x02,0x27,0x10,0x10,0x0B,
0xBE,0x86,0x20,0x0C,0xC9,0x25,0xB5,0x68,0x57,0xB3,0x85,0x6F,0x20,0x09,
0xD4,0x66,0xB9,0x9F,0xE4,0x61,0xCE,0x0E,0xF9,0xDE,0x5E,0x98,0xC9,
0xD9,0x29,0x22,0x98,0xD0,0xB0,0xB4,0xA8,0xD7,0xC7,0x17,0x3D,0xB3,0x59,
0x81,0x0D,0xB4,0x2E,0x3B,0x5C,0xBD,0xB7,0xAD,0x6C,0xBA,0xC0,0x20,
0x83,0xB8,0xED,0xB6,0xB3,0xBF,0x9A,0x0C,0xE2,0xB6,0x03,0x9A,0xD2,0xB1,
0x74,0x39,0x47,0xD5,0xEA,0xAF,0x77,0xD2,0x9D,0x15,0x26,0xDB,0x04,
0x83,0x16,0xDC,0x73,0x12,0x0B,0x63,0xE3,0x84,0x3B,0x64,0x94,0x3E,0x6A,
0x6D,0x0D,0xA8,0x5A,0x6A,0x7A,0x0B,0xCF,0x0E,0xE4,0x9D,0xFF,0x09,
0x93,0x27,0xAE,0x00,0x0A,0xB1,0x9E,0x07,0x7D,0x44,0x93,0x0F,0xF0,0xD2,
0xA3,0x08,0x87,0x68,0xF2,0x01,0x1E,0xFE,0xC2,0x06,0x69,0x5D,0x57,
0x62,0xF7,0xCB,0x67,0x65,0x80,0x71,0x36,0x6C,0x19,0xE7,0x06,0x6B,0x6E,
0x76,0x1B,0xD4,0xFE,0xE0,0x2B,0xD3,0x89,0x5A,0x7A,0xDA,0x10,0xCC,
0x4A,0xDD,0x67,0x6F,0xDF,0xB9,0xF9,0xF9,0xEF,0xBE,0x8E,0x43,0xBE,0xB7,
0x17,0xD5,0x8E,0xB0,0x60,0xE8,0xA3,0xD6,0xD6,0x7E,0x93,0xD1,0xA1,
0xC4,0xC2,0xD8,0x38,0x52,0xF2,0xDF,0x4F,0xF1,0x67,0xBB,0xD1,0x67,0x57,
0xBC,0xA6,0xDD,0x06,0xB5,0x3F,0x4B,0x36,0xB2,0x48,0xDA,0x2B,0x0D,
0xD8,0x4C,0x1B,0x0A,0xAF,0xF6,0x4A,0x03,0x36,0x60,0x7A,0x04,0x41,0xC3,
0xEF,0x60,0xDF,0x55,0xDF,0x67,0xA8,0xEF,0x8E,0x6E,0x31,0x79,0xBE,
0x69,0x46,0x8C,0xB3,0x61,0xCB,0x1A,0x83,0x66,0xBC,0xA0,0xD2,0x6F,0x25,
0x36,0xE2,0x68,0x52,0x95,0x77,0x0C,0xCC,0x03,0x47,0x0B,0xBB,0xB9,
0x16,0x02,0x22,0x2F,0x26,0x05,0x55,0xBE,0x3B,0xBA,0xC5,0x28,0x0B,0xBD,
0xB2,0x92,0x5A,0xB4,0x2B,0x04,0x6A,0xB3,0x5C,0xA7,0xFF,0xD7,0xC2,
0x31,0xCF,0xD0,0xB5,0x8B,0x9E,0xD9,0x2C,0x1D,0xAE,0xDE,0x5B,0xB0,0xC2,
0x64,0x9B,0x26,0xF2,0x63,0xEC,0x9C,0xA3,0x6A,0x75,0x0A,0x93,0x6D,
0x02,0xA9,0x06,0x09,0x9C,0x3F,0x36,0x0E,0xEB,0x85,0x67,0x07,0x72,0x13,
0x57,0x00,0x05,0x82,0x4A,0xBF,0x95,0x14,0x7A,0xB8,0xE2,0xAE,0x2B,
0xB1,0x7B,0x38,0x1B,0xB6,0x0C,0x9B,0x8E,0xD2,0x92,0x0D,0xBE,0xD5,0xE5,
0xB7,0xEF,0xDC,0x7C,0x21,0xDF,0xDB,0x0B,0xD4,0xD2,0xD3,0x86,0x42,
0xE2,0xD4,0xF1,0xF8,0xB3,0xDD,0x68,0x6E,0x83,0xDA,0x1F,0xCD,0x16,0xBE,
0x81,0x5B,0x26,0xB9,0xF6,0xE1,0x77,0xB0,0x6F,0x77,0x47,0xB7,0x18,
0xE6,0x5A,0x08,0x88,0x70,0x6A,0x0F,0xFF,0xCA,0x3B,0x06,0x66,0x5C,0x0B,
0x01,0x11,0xFF,0x9E,0x65,0x8F,0x69,0xAE,0x62,0xF8,0xD3,0xFF,0x6B,
0x61,0x45,0xCF,0x6C,0x16,0x78,0xE2,0x0A,0xA0,0xEE,0xD2,0x0D,0xD7,0x54,
0x83,0x04,0x4E,0xC2,0xB3,0x03,0x39,0x61,0x26,0x67,0xA7,0xF7,0x16,
0x60,0xD0,0x4D,0x47,0x69,0x49,0xDB,0x77,0x6E,0x3E,0x4A,0x6A,0xD1,0xAE,
0xDC,0x5A,0xD6,0xD9,0x66,0x0B,0xDF,0x40,0xF0,0x3B,0xD8,0x37,0x53,
0xAE,0xBC,0xA9,0xC5,0x9E,0xBB,0xDE,0x7F,0xCF,0xB2,0x47,0xE9,0xFF,0xB5,
0x30,0x1C,0xF2,0xBD,0xBD,0x8A,0xC2,0xBA,0xCA,0x30,0x93,0xB3,0x53,
0xA6,0xA3,0xB4,0x24,0x05,0x36,0xD0,0xBA,0x93,0x06,0xD7,0xCD,0x29,0x57,
0xDE,0x54,0xBF,0x67,0xD9,0x23,0x2E,0x7A,0x66,0xB3,0xB8,0x4A,0x61,
0xC4,0x02,0x1B,0x68,0x5D,0x94,0x2B,0x6F,0x2A,0x37,0xBE,0x0B,0xB4,0xA1,
0x8E,0x0C,0xC3,0x1B,0xDF,0x05,0x5A,0x8D,0xEF,0x02,0x2D,0xC3,0x8D,0x40,0x00
};
protected void BuildLookupTable()
{
if (LookupTable == null)
{
LookupTable = new uint[256];
for (int i = 0; i < LookupTable.Length; i++)
{
LookupTable[i] = BitConverter.ToUInt32(ByteLookupTable, i);
}
}
}
protected override uint CalculateBuffer(byte[] buffer, uint crc, int startPos, int endPos)
{
for (int i = 0; i < endPos; i++)
{
crc = LookupTable[(crc ^ buffer[i]) & 0xff] ^ (crc >> 8);
}
return crc;
}
The lookup table constants are correct for that polynomial, but apparently the conversion done by BuildLookupTable() is totally messed up. It would have been easy to rewrite the table to avoid needing BuildLookupTable().
What follows is from the original answer, which assumed that the table was converted correctly. Which it isn't.
As it is, this isn't a CRC, so the combination approach for a CRC does not apply here.
The one thing missing from your definition is what the initial value for the CRC is, and possibly if there is an exclusive-or done on the final CRC. That polynomial is the same as used by zlib, PKZIP, etc., where the initial CRC value is 0xffffffff and the final exclusive-or is with 0xffffffff. That CRC is referred to as CRC-32/ISO-HDLC.
Whatever your initial and the final exclusive-or values are, if they are equal, then you can use the crc32_combine() function from zlib as is. If they are not, you can still use crc32_combine(), but you need to exclusive-or the input and output CRC values of that function with the exclusive-or of the initial and final exclusive-or values.
Related
One hot encoding of the states of a C FSM
Basically, I just would like to know if this is a good idea to manually one hot encode the states of a C FSM. I implemented that to write an easy state transition validator: typedef enum { FSM_State1 = (1 << 0), FSM_State2 = (1 << 1), FSM_State3 = (1 << 2), FSM_StateError = (1 << 3) } states_t; Then the validation: states_t nextState, requestedState; uint32_t validDestStates = 0; // Compute requested state requestedState = FSM_State1; // Define valid transitions validDestStates |= FSM_State2; validDestStates |= FSM_State3; // Check transition if (validDestStates & requestedState) { // Valid transition nextState = requestedState; } else { // Illegal transition nextState = FSM_StateError; } I know that I am limited to the maximum size of integer that I can use. But I don't have that many states. So it is not an issue Is there something better than this encoding? Are there some drawbacks I don't see yet? Thanks for your help! Edit: changed validation test according to user3386109 comment Final thoughts So final here is what I did: 1/ State enum is a "classical" enum: typedef enum { FSM_State1, FSM_State2, FSM_State3, FSM_StateError } states_t; 2/ Bit fields for valid transitions: struct s_fsm_stateValidation { bool State1IsValid: 1; bool State2Valid: 1; bool State3IsValid: 1; bool StateErrorIsValid: 1; /// Reserved space for 32bit reserved in the union uint32_t reserved: 28; }; 3/ Create an union for the validation typedef union FSM_stateValidation_u { /// The bit field of critical system errors struct s_fsm_stateValidation state; /// Access the bit field as a whole uint32_t all; } u_FSM_stateValidation; 4/ I changed the validation: u_FSM_stateValidation validDestStates; // Set valid states validDestStates.state.State1 = true; // Compute requestedState requestedState = FSM_State2; if ((validDestStates.all & ((uint32_t) (1 << requestedState)) )) { // Next state is legal return requestedState; } else { return FSM_StateError; }
From a quick Google, "one hot encoded" means that every valid code has precisely one bit set, which seems to be what you're doing. The search results suggested this was a hardware design pattern. Drawbacks I can think of are... As you suggest, you're dramatically limiting the number of valid codes - for 32 bits you have a maximum of 32 codes/states instead of more than 4 billion. It's not ideal for lookup tables, which are a common implementation for switch statements. There are usually an intrinsic available to determine which is the lowest bit set, but I wouldn't bet on compilers using that automatically. Those aren't big issues, though, provided the number of states is small. The question IMO, then, is whether there's an advantage to justify that cost. It doesn't need to be a huge advantage, but there has to be some kind of point. The best I can come up with is that you can use bitwise tricks to specify sets of states, so you can test whether the current state is in a given set efficiently - if you have some action that needs to be done in states (1<<0) and (1<<3), for example, you could test if (state & 0x9).
Identifying a trend in C - Micro controller sampling
I'm working on an MC68HC11 Microcontroller and have an analogue voltage signal going in that I have sampled. The scenario is a weighing machine, the large peaks are when the object hits the sensor and then it stabilises (which are the samples I want) and then peaks again before the object roles off. The problem I'm having is figuring out a way for the program to detect this stable point and average it to produce an overall weight but can't figure out how :/. One way I have thought about doing is comparing previous values to see if there is not a large difference between them but I haven't had any success. Below is the C code that I am using: #include <stdio.h> #include <stdarg.h> #include <iof1.h> void main(void) { /* PORTA, DDRA, DDRG etc... are LEDs and switch ports */ unsigned char *paddr, *adctl, *adr1; unsigned short i = 0; unsigned short k = 0; unsigned char switched = 1; /* is char the smallest data type? */ unsigned char data[2000]; DDRA = 0x00; /* All in */ DDRG = 0xff; adctl = (unsigned char*) 0x30; adr1 = (unsigned char*) 0x31; *adctl = 0x20; /* single continuos scan */ while(1) { if(*adr1 > 40) { if(PORTA == 128) /* Debugging switch */ { PORTG = 1; } else { PORTG = 0; } if(i < 2000) { while(((*adctl) & 0x80) == 0x00); { data[i] = *adr1; } /* if(i > 10 && (data[(i-10)] - data[i]) < 20) */ i++; } if(PORTA == switched) { PORTG = 31; /* Print a delimeter so teemtalk can send to excel */ for(k=0;k<2000;k++) { printf("%d,",data[k]); } if(switched == 1) /*bitwise manipulation more efficient? */ { switched = 0; } else { switched = 1; } PORTG = 0; } if(i >= 2000) { i = 0; } } } } Look forward to hearing any suggestions :) (The graph below shows how these values look, the red box is the area I would like to identify.
As you sample sequence has glitches (short lived transients) try to improve the hardware ie change layout, add decoupling, add filtering etc. If that approach fails, then a median filter [1] of say five places long, which takes the last five samples, sorts them and outputs the middle one, so two samples of the transient have no effect on it's output. (seven places ...three transient) Then a computationally efficient exponential averaging lowpass filter [2] y(n) = y(n–1) + alpha[x(n) – y(n–1)] choosing alpha (1/2^n, division with right shifts) to yield a time constant [3] of less than the underlying response (~50samples), but still filter out the noise. Increasing the effective fractional bits will avoid the quantizing issues. With this improved sample sequence, thresholds and cycle count, can be applied to detect quiescent durations. Additionally if the end of the quiescent period is always followed by a large, abrupt change then using a sample delay "array", enables the detection of the abrupt change but still have available the last of the quiescent samples for logging. [1] http://en.wikipedia.org/wiki/Median_filter [2] http://www.dsprelated.com/showarticle/72.php [3] http://en.wikipedia.org/wiki/Time_constant Note Adding code for the above filtering operations will lower the maximum possible sample rate but printf can be substituted for something faster.
Continusously store the current value and the delta from the previous value. Note when the delta is decreasing as the start of weight application to the scale Note when the delta is increasing as the end of weight application to the scale Take the X number of values with the small delta and average them BTW, I'm sure this has been done 1M times before, I'm thinking that a search for scale PID or weight PID would find a lot of information.
Don't forget using ___delay_ms(XX) function somewhere between the reading values, if you will compare with the previous one. The difference in each step will be obviously small, if the code loop continuously.
Looking at your nice graphs, I would say you should look only for the falling edge, it is much consistent than leading edge. In other words, let the samples accumulate, calculate the running average all the time with predefined window size, remember the deviation of the previous values just for reference, check for a large negative bump in your values (like absolute value ten times smaller then current running average), your running average is your value. You could go back a little bit (disregarding last few values in your average, and recalculate) to compensate for small positive bump visible in your picture before each negative bump...No need for heavy math here, you could not model the reality better then your picture has shown, just make sure that your code detect the end of each and every sample. You have to be fast enough with sample to make sure no negative bump was missed (or you will have big time error in your data averaging). And you don't need that large arrays, running average is better based on smaller window size, smaller residual error in your case when you detect the negative bump.
CRC-32 on MicroController (Atmel)
I am currently trying to implement a CRC-32 for an incoming datastream (Serial communication) on an ATMEGA1280 and I am a little lost how to do this on the embedded side in C.... If anyone could point me in the proper direction and/or help in anyway i would greatly appreciate it...
There are plenty of CRC-32 implementations in C. The AT MEGA1280 has 128 KB of code space, it shoudn't have any problems running any off-the-shelf implementation. Here is pretty much the first one I found.
You should know what polynomial you are dealing with. So it is not enough to know that you are using CRC, but you should also know polynomial. You are looking for function with this kind of prototype uint32_t crc(uint8_t * data, int len, uint32_t polynomial) Or even further, this function can also support updating if you getting your data asynchronously, so there is extra parameter for resuming computation. With CRC32 you stream bits through CRC function and you get 32 bit number that is used for data corruption checking. I am sure you will be able to find C code of crc online. EDIT: It looks like, CRC32 polynomial is sort of standard and is usually unified. That mean that CRC32 implementation will employ correct polynomial. http://en.wikipedia.org/wiki/Computation_of_CRC
Copy-paste solution for AVR, ripped from here: /* crc32.c -- compute the CRC-32 of a data stream * Copyright (C) 1995-1996 Mark Adler * For conditions of distribution and use, see copyright notice in zlib.h */ /* $Id: crc32.c,v 1.1 2007/10/07 20:47:35 matthias Exp $ */ /* ======================================================================== * Table of CRC-32's of all single-byte values (made by make_crc_table) */ static unsigned long crc_table[256] = { 0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L, 0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L, 0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L, 0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL, 0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L, 0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L, 0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L, 0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL, 0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L, 0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL, 0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L, 0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L, 0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L, 0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL, 0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL, 0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L, 0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL, 0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L, 0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L, 0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L, 0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL, 0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L, 0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L, 0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL, 0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L, 0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L, 0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L, 0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L, 0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L, 0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL, 0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL, 0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L, 0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L, 0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL, 0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL, 0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L, 0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL, 0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L, 0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL, 0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L, 0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL, 0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L, 0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L, 0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL, 0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L, 0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L, 0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L, 0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L, 0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L, 0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L, 0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL, 0x2d02ef8dL }; #define DO1(buf) crc = crc_table[((int)crc ^ (*buf++)) & 0xff] ^ (crc >> 8); #define DO2(buf) DO1(buf); DO1(buf); #define DO4(buf) DO2(buf); DO2(buf); #define DO8(buf) DO4(buf); DO4(buf); unsigned long crc32(crc, buf, len) unsigned long crc; const unsigned char *buf; unsigned int len; { if (!buf) return(0L); crc = crc ^ 0xffffffffL; while (len >= 8) { DO8(buf); len -= 8; } if (len) do { DO1(buf); } while (--len); return(crc^0xffffffffL); }
Hash table implementation
I just bought a book "C Interfaces and Implementations". in chapter one , it has implemented a "Atom" structure, sample code as follow: #define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0]))) static struct atom { struct atom *link; int len; char *str; } *buckets[2048]; static unsigned long scatter[] = { 2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785, 1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339, 813528929, 1703199216, 618906479, 573714703, 766270699, 275680090, 1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764, 980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923, 471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469, 1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139, 744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759, 1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720, 704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719, 269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261, 1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972, 907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594, 259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538, 348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668, 1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579, 871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645, 109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608, 706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252, 958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735, 1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362, 299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512, 1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391, 618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630, 2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590, 1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822, 316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723, 1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929, 946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800, 2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094, 523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572, 2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537, 1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820, 1601294739, 484902984, 139978006, 503211273, 294184214, 176384212, 281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263, 1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330, 275676551, 360187215, 267062626, 265012701, 719930310, 1621212876, 2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304, 48964196, 5816381, 1889425288, 188942202, 509027654, 36125855, 365326415, 790369079, 264348929, 513183458, 536647531, 13672163, 313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160, 2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685, 1884137923, 53392249, 1735424165, 1602280572 }; const char *Atom_new(const char *str, int len) { unsigned long h; int i; struct atom *p; assert(str); assert(len >= 0); for (h = 0, i = 0; i < len; i++) h = (h<<1) + scatter[(unsigned char)str[i]]; h &= NELEMS(buckets)-1; for (p = buckets[h]; p; p = p->link) if (len == p->len) { for (i = 0; i < len && p->str[i] == str[i]; ) i++; if (i == len) return p->str; } p = ALLOC(sizeof (*p) + len + 1); p->len = len; p->str = (char *)(p + 1); if (len > 0) memcpy(p->str, str, len); p->str[len] = '\0'; p->link = buckets[h]; buckets[h] = p;//insert atom in front of list return p->str; } at end of chapter , in exercises 3.1, the book's author said "Most texts recommend using a prime number for the size of buckets. Using a prime and a good hash function usually gives a better distribution of the lengths of the lists hanging off of buckets. Atom uses a power of two, which is sometimes explicitly cited as a bad choice. Write a program to generate or read, say, 10,000 typical strings and measure Atom_new’s speed and the distribution of the lengths of the lists. Then change buckets so that it has 2,039 entries (the largest prime less than 2,048), and repeat the measurements. Does using a prime help? How much does your conclusion depend on your specific machine?" so I did changed that hash table size to 2039,but it seems a prime number actually made a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too. I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function? I am using this function to print out the lengths of the atom lists #define B_SIZE 2048 void Atom_print(void) { int i,t; struct atom *atom; for(i= 0;i<B_SIZE;i++) { t = 0; for(atom=buckets[i];atom;atom=atom->link) { ++t; } printf("%d ",t); } }
Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND. So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4. The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers. BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.
I think it's the code to select the bucket. In the code you pasted it says: h &= NELEMS(buckets)-1; That works fine for sizes which are powers of two, since its final effect is choosing the lower bits of h. For other sizes, NELEMS(buckets)-1 will have bits in 0 and the bit-wise & operator will discard those bits, effectively leaving "holes" in the bucket list. The general formula for bucket selection is: h = h % NELEMS(buckets);
This is what Julienne Walker from Eternally Confuzzled has to say about hash table sizes: When it comes to hash tables, the most recommended table size is any prime number. This recommendation is made because hashing in general is misunderstood, and poor hash functions require an extra mixing step of division by a prime to resemble a uniform distribution. Another reason that a prime table size is recommended is because several of the collision resolution methods require it to work. In reality, this is a generalization and is actually false (a power of two with odd step sizes will typically work just as well for most collision resolution strategies), but not many people consider the alternatives and in the world of hash tables, prime rules.
There's another factor at work here and that is that the constant hashing values should all be odd/prime and widely dispersed. If you have an even number of units (characters for instance) in the key to be hashed then having all odd constants will give you an even initial hash value. For an odd number of units you'd get an odd number. I've done some experimenting with this and just the 50/50% split was worth a lot in evening the distribution. Of course if all keys are equally long this doesn't matter. The hashing also needs to ensure that you won't get the same initial hash value for "AAB" as for "ABA" or "BAA".
C Library for compressing sequential positive integers
I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows: uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 }; Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434. The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution). I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations. Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks. A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory. Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31. For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working. I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally. However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes. Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's. Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability, Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write. With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs. Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements: You want to compress very small items (8 bytes each). You need efficient random access for each item. The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space? If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint). For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used. Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines. Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-) IndexCompressor.h // // index compressor class // #pragma once #include "File.h" const int IC_BUFFER_SIZE = 8192; // // index compressor // class IndexCompressor { private : File *m_pFile; WA_DWORD m_dwRecNo; WA_DWORD m_dwWordNo; WA_DWORD m_dwRecordCount; WA_DWORD m_dwHitCount; WA_BYTE m_byBuffer[IC_BUFFER_SIZE]; WA_DWORD m_dwBytes; bool m_bDebugDump; void FlushBuffer(void); public : IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; } ~IndexCompressor(void) {} void Attach(File& File) { m_pFile = &File; } void Begin(void); void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo); void End(void); WA_DWORD GetRecordCount(void) { return m_dwRecordCount; } WA_DWORD GetHitCount(void) { return m_dwHitCount; } void DebugDump(void) { m_bDebugDump = true; } }; IndexCompressor.cpp // // index compressor class // #include "stdafx.h" #include "IndexCompressor.h" void IndexCompressor::FlushBuffer(void) { ASSERT(m_pFile != 0); if (m_dwBytes > 0) { m_pFile->Write(m_byBuffer, m_dwBytes); m_dwBytes = 0; } } void IndexCompressor::Begin(void) { ASSERT(m_pFile != 0); m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0; m_dwBytes = 0; } void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo) { ASSERT(m_pFile != 0); WA_BYTE buffer[16]; int nbytes = 1; ASSERT(dwRecNo >= m_dwRecNo); if (dwRecNo != m_dwRecNo) m_dwWordNo = 0; if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo) ++m_dwRecordCount; ++m_dwHitCount; WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo; WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo; if (m_bDebugDump) { TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta); } // 1WWWWWWW if (dwRecNoDelta == 0 && dwWordNoDelta < 128) { buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta); } // 01WWWWWW WWWWWWWW else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384) { buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8); buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff); nbytes += sizeof(WA_BYTE); } // 001RRRRR WWWWWWWW WWWWWWWW else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536) { buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta); WA_WORD *p = (WA_WORD *) (buffer+1); *p = WA_WORD(dwWordNoDelta); nbytes += sizeof(WA_WORD); } else { // 0001rrww buffer[0] = 0x10; // encode recno if (dwRecNoDelta < 256) { buffer[nbytes] = WA_BYTE(dwRecNoDelta); nbytes += sizeof(WA_BYTE); } else if (dwRecNoDelta < 65536) { buffer[0] |= 0x04; WA_WORD *p = (WA_WORD *) (buffer+nbytes); *p = WA_WORD(dwRecNoDelta); nbytes += sizeof(WA_WORD); } else { buffer[0] |= 0x08; WA_DWORD *p = (WA_DWORD *) (buffer+nbytes); *p = dwRecNoDelta; nbytes += sizeof(WA_DWORD); } // encode wordno if (dwWordNoDelta < 256) { buffer[nbytes] = WA_BYTE(dwWordNoDelta); nbytes += sizeof(WA_BYTE); } else if (dwWordNoDelta < 65536) { buffer[0] |= 0x01; WA_WORD *p = (WA_WORD *) (buffer+nbytes); *p = WA_WORD(dwWordNoDelta); nbytes += sizeof(WA_WORD); } else { buffer[0] |= 0x02; WA_DWORD *p = (WA_DWORD *) (buffer+nbytes); *p = dwWordNoDelta; nbytes += sizeof(WA_DWORD); } } // update current setting m_dwRecNo = dwRecNo; m_dwWordNo = dwWordNo; // add compressed data to buffer ASSERT(buffer[0] != 0); ASSERT(nbytes > 0 && nbytes < 10); if (m_dwBytes + nbytes > IC_BUFFER_SIZE) FlushBuffer(); CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes); m_dwBytes += nbytes; if (m_bDebugDump) { for (int i = 0; i < nbytes; ++i) TRACE("%02X ", buffer[i]); TRACE("\n"); } } void IndexCompressor::End(void) { FlushBuffer(); m_pFile->Write(WA_BYTE(0)); }
You've omitted critical information about the number of strings you intend to index. But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order. If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part. More data on the problem would be welcome: Number of strings Average length of a string Maximum length of a string Median length of strings Degree to which the strings file compresses with gzip Whether you are allowed to change the order of strings to improve compression EDIT The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically. You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.