Decoding TIFF LZW codes not yet in the dictionary - c

I made a decoder of LZW-compressed TIFF images, and all the parts work, it can decode large images at various bit depths with or without horizontal prediction, except in one case. While it decodes files written by most programs (like Photoshop and Krita with various encoding options) fine, there's something very strange about the files created by ImageMagick's convert, it produces LZW codes that aren't yet in the dictionary, and I don't know how to handle it.
Most of the time the 9 to 12-bit code in the LZW stream that isn't yet in the dictionary is the next one that my decoding algorithm will try to put in the dictionary (which I'm not sure should be a problem although my algorithm fails on an image that contains such cases), but at times it can even be hundreds of codes into the future. In one case the first code after the clear code (256) is 364, which seems quite impossible given that the clear code clears my dictionary of all codes 258 and above, in another case the code is 501 when my dictionary only goes up to 317!
I have no idea how to deal with it, but it seems that I'm the only one with this problem, the decoders in other programs load such images fine. So how do they do it?
Here's the core of my decoding algorithm, obviously due to how much code is involved I can't provide complete compilable code in a compact manner, but since this is a matter of algorithmic logic this should be enough. It follows closely the algorithm described in the official TIFF specification (page 61), in fact most of the spec's pseudo code is in the comments.
void tiff_lzw_decode(uint8_t *coded, buffer_t *dec)
{
buffer_t word={0}, outstring={0};
size_t coded_pos; // position in bits
int i, new_index, code, maxcode, bpc;
buffer_t *dict={0};
size_t dict_as=0;
bpc = 9; // starts with 9 bits per code, increases later
tiff_lzw_calc_maxcode(bpc, &maxcode);
new_index = 258; // index at which new dict entries begin
coded_pos = 0; // bit position
lzw_dict_init(&dict, &dict_as);
while ((code = get_bits_in_stream(coded, coded_pos, bpc)) != 257) // while ((Code = GetNextCode()) != EoiCode)
{
coded_pos += bpc;
if (code >= new_index)
printf("Out of range code %d (new_index %d)\n", code, new_index);
if (code == 256) // if (Code == ClearCode)
{
lzw_dict_init(&dict, &dict_as); // InitializeTable();
bpc = 9;
tiff_lzw_calc_maxcode(bpc, &maxcode);
new_index = 258;
code = get_bits_in_stream(coded, coded_pos, bpc); // Code = GetNextCode();
coded_pos += bpc;
if (code == 257) // if (Code == EoiCode)
break;
append_buf(dec, &dict[code]); // WriteString(StringFromCode(Code));
clear_buf(&word);
append_buf(&word, &dict[code]); // OldCode = Code;
}
else if (code < 4096)
{
if (dict[code].len) // if (IsInTable(Code))
{
append_buf(dec, &dict[code]); // WriteString(StringFromCode(Code));
lzw_add_to_dict(&dict, &dict_as, new_index, 0, word.buf, word.len, &bpc);
lzw_add_to_dict(&dict, &dict_as, new_index, 1, dict[code].buf, 1, &bpc); // AddStringToTable
new_index++;
tiff_lzw_calc_bpc(new_index, &bpc, &maxcode);
clear_buf(&word);
append_buf(&word, &dict[code]); // OldCode = Code;
}
else
{
clear_buf(&outstring);
append_buf(&outstring, &word);
bufwrite(&outstring, word.buf, 1); // OutString = StringFromCode(OldCode) + FirstChar(StringFromCode(OldCode));
append_buf(dec, &outstring); // WriteString(OutString);
lzw_add_to_dict(&dict, &dict_as, new_index, 0, outstring.buf, outstring.len, &bpc); // AddStringToTable
new_index++;
tiff_lzw_calc_bpc(new_index, &bpc, &maxcode);
clear_buf(&word);
append_buf(&word, &dict[code]); // OldCode = Code;
}
}
}
free_buf(&word);
free_buf(&outstring);
for (i=0; i < dict_as; i++)
free_buf(&dict[i]);
free(dict);
}
As for the results that my code produces in such situations it's quite clear from how it looks that it's only those few codes that are badly decoded, everything before and after is properly decoded, but obviously in most cases the subsequent image after one of these mystery future codes is ruined by virtue of shifting the rest of the decoded bytes by a few places. That means that my reading of the 9 to 12-bit code stream is correct, so this really means that I see a 364 code right after a 256 dictionary-clearing code.
Edit: Here's an example file that contains such weird codes. I've also found a small TIFF LZW loading library that suffers from the same problem, it crashes where my loader finds the first weird code in this image (code 3073 when the dictionary only goes up to 2051). The good thing is that since it's a small library you can test it with the following code:
#include "loadtiff.h"
#include "loadtiff.c"
void loadtiff_test(char *path)
{
int width, height, format;
floadtiff(fopen(path, "rb"), &width, &height, &format);
}
And if anyone insists on diving into my code (which should be unnecessary, and it's a big library) here's where to start.

The bogus codes come from trying to decode more than we're supposed to. The problem is that a LZW strip may sometimes not end with an End-of-Information 257 code, so the decoding loop has to stop when a certain number of decoded bytes have been output. That number of bytes per strip is determined by the TIFF tags ROWSPERSTRIP * IMAGEWIDTH * BITSPERSAMPLE / 8, and if PLANARCONFIG is 1 (which means interleaved channels as opposed to planar), by multiplying it all by SAMPLESPERPIXEL. So on top of stopping the decoding loop when a code 257 is encountered the loop must also be stopped after that count of decoded bytes has been reached.

Related

Reading serial port faster

I have a computer software that sends RGB color codes to Arduino using USB. It works fine when they are sent slowly but when tens of them are sent every second it freaks out. What I think happens is that the Arduino serial buffer fills out so quickly that the processor can't handle it the way I'm reading it.
#define INPUT_SIZE 11
void loop() {
if(Serial.available()) {
char input[INPUT_SIZE + 1];
byte size = Serial.readBytes(input, INPUT_SIZE);
input[size] = 0;
int channelNumber = 0;
char* channel = strtok(input, " ");
while(channel != 0) {
color[channelNumber] = atoi(channel);
channel = strtok(0, " ");
channelNumber++;
}
setColor(color);
}
}
For example the computer might send 255 0 123 where the numbers are separated by space. This works fine when the sending interval is slow enough or the buffer is always filled with only one color code, for example 255 255 255 which is 11 bytes (INPUT_SIZE). However if a color code is not 11 bytes long and a second code is sent immediately, the code still reads 11 bytes from the serial buffer and starts combining the colors and messes them up. How do I avoid this but keep it as efficient as possible?
It is not a matter of reading the serial port faster, it is a matter of not reading a fixed block of 11 characters when the input data has variable length.
You are telling it to read until 11 characters are received or the timeout occurs, but if the first group is fewer than 11 characters, and a second group follows immediately there will be no timeout, and you will partially read the second group. You seem to understand that, so I am not sure how you conclude that "reading faster" will help.
Using your existing data encoding of ASCII decimal space delimited triplets, one solution would be to read the input one character at a time until the entire triplet were read, however you could more simply use the Arduino ReadBytesUntil() function:
#define INPUT_SIZE 3
void loop()
{
if (Serial.available())
{
char rgb_str[3][INPUT_SIZE+1] = {{0},{0},{0}};
Serial.readBytesUntil( " ", rgb_str[0], INPUT_SIZE );
Serial.readBytesUntil( " ", rgb_str[1], INPUT_SIZE );
Serial.readBytesUntil( " ", rgb_str[2], INPUT_SIZE );
for( int channelNumber = 0; channelNumber < 3; channelNumber++)
{
color[channelNumber] = atoi(channel);
}
setColor(color);
}
}
Note that this solution does not require the somewhat heavyweight strtok() processing since the Stream class has done the delimiting work for you.
However there is a simpler and even more efficient solution. In your solution you are sending ASCII decimal strings then requiring the Arduino to spend CPU cycles needlessly extracting the fields and converting to integer values, when you could simply send the byte values directly - leaving if necessary the vastly more powerful PC to do any necessary processing to pack the data thus. Then the code might be simply:
void loop()
{
if( Serial.available() )
{
for( int channelNumber = 0; channelNumber < 3; channelNumber++)
{
color[channelNumber] = Serial.Read() ;
}
setColor(color);
}
}
Note that I have not tested any of above code, and the Arduino documentation is lacking in some cases with respect to descriptions of return values for example. You may need to tweak the code somewhat.
Neither of the above solve the synchronisation problem - i.e. when the colour values are streaming, how do you know which is the start of an RGB triplet? You have to rely on getting the first field value and maintaining count and sync thereafter - which is fine until perhaps the Arduino is started after data stream starts, or is reset, or the PC process is terminated and restarted asynchronously. However that was a problem too with your original implementation, so perhaps a problem to be dealt with elsewhere.
First of all, I agree with #Thomas Padron-McCarthy. Sending character string instead of a byte array(11 bytes instead of 3 bytes, and the parsing process) is wouldsimply be waste of resources. On the other hand, the approach you should follow depends on your sender:
Is it periodic or not
Is is fixed size or not
If it's periodic you can check in the time period of the messages. If not, you need to check the messages before the buffer is full.
If you think printable encoding is not suitable for you somehow; In any case i would add an checksum to the message. Let's say you have fixed size message structure:
typedef struct MyMessage
{
// unsigned char id; // id of a message maybe?
unsigned char colors[3]; // or unsigned char r,g,b; //maybe
unsigned char checksum; // more than one byte could be a more powerful checksum
};
unsigned char calcCheckSum(struct MyMessage msg)
{
//...
}
unsigned int validateCheckSum(struct MyMessage msg)
{
//...
if(valid)
return 1;
else
return 0;
}
Now, you should check every 4 byte (the size of MyMessage) in a sliding window fashion if it is valid or not:
void findMessages( )
{
struct MyMessage* msg;
byte size = Serial.readBytes(input, INPUT_SIZE);
byte msgSize = sizeof(struct MyMessage);
for(int i = 0; i+msgSize <= size; i++)
{
msg = (struct MyMessage*) input[i];
if(validateCheckSum(msg))
{// found a message
processMessage(msg);
}
else
{
//discard this byte, it's a part of a corrupted msg (you are too late to process this one maybe)
}
}
}
If It's not a fixed size, it gets complicated. But i'm guessing you don't need to hear that for this case.
EDIT (2)
I've striked out this edit upon comments.
One last thing, i would use a circular buffer. First add the received bytes into the buffer, then check the bytes in that buffer.
EDIT (3)
I gave thought on comments. I see the point of printable encoded messages. I guess my problem is working in a military company. We don't have printable encoded "fire" arguments here :) There are a lot of messages come and go all the time and decoding/encoding printable encoded messages would be waste of time. Also we use hardwares which usually has very small messages with bitfields. I accept that it could be more easy to examine/understand a printable message.
Hope it helps,
Gokhan.
If faster is really what you want....this is little far fetched.
The fastest way I can think of to meet your needs and provide synchronization is by sending a byte for each color and changing the parity bit in a defined way assuming you can read the parity and bytes value of the character with wrong parity.
You will have to deal with the changing parity and most of the characters will not be human readable, but it's gotta be one of the fastest ways to send three bytes of data.

gnu FORTRAN unformatted file record markers stored as 64-bit width?

I have a legacy code and some unformatted data files that it reads, and it worked with gnu-4.1.2. I don't have access to the method that originally generated these data files. When I compile this code with a newer gnu compiler (gnu-4.7.2) and attempt to load the old data files on a different computer, it is having difficulty reading them. I start by opening the file and reading in the first record which consists of three 32-bit integers:
open(unit, file='data.bin', form='unformatted', status='old')
read(unit) x,y,z
I am expecting these three integers here to describe x,y,z spans so that next it can load a 3D matrix of float values with those same dimensions. However, instead it's loading a 0 for the first value, then the next two are offset.
Expecting:
x=26, y=127, z=97 (1A, 7F, 61 in hex)
Loaded:
x=0, y=26, z=127 (0, 1A, 7F in hex)
When I checked the data file in a hex editor, I think I figured out what was happening.
The first record marker in this case has a value of 12 (0C in hex) since it's reading three integers at 4 bytes each. This marker is stored both before and after the record. However, I notice that the 32bits immediately after each record marker is 00000000. So either the record markers are treated as 64bit integers (little-Endian) or there is a 32-bit zero padding after each record marker. Either way, the code generated with the new compiler is reading the record markers as 32-bit integers and not expecting any padding. This effectively intrudes/corrupts the data being read in.
Is there an easy way to fix this non-portable issue? The old and new hardware are 64 bit architecture and so is the executable I compiled. If I try to use the older compiler version again will it solve the problem, or is it hardware dependent? I'd prefer to use the newer compilers because they are more efficient, and I really don't want to edit the source code to open all the files as access='stream' and manually read in a trailing 0 integer after each record marker, both before and after each record.
P.S. I could probably write a C++ code to alter the data files and remove these zero paddings if there is no easier alternative.
See the -frecord-marker= option in the gfortran manual. With -frecord-marker=8 you can read the old style unformatted sequential files produced by older versions of gfortran.
Seeing as how Fortran doesn't have a standardization on this, I opted to convert the data files to a new format that uses 32-bit wide record lengths instead of 64-bit wide. In case anyone needs to do this in the future I've included some Visual C++ code here that worked for me and should be easily modifiable to C or another language. I have also uploaded a Windows executable (fortrec.zip) here.
CFile OldFortFile, OutFile;
const int BUFLEN = 1024*20;
char pbuf[BUFLEN];
int i, iIn, iRecLen, iRecLen2, iLen, iRead, iError = 0;
CString strInDir = "C:\folder\";
CString strIn = "file.dat";
CString strOutDir = "C:\folder\fortnew\"
system("mkdir \"" + strOutDir + "\""); //create a subdir to hold the output files
strIn = strInDir + strIn;
strOut = strOutDir + strIn;
if(OldFortFile.Open(strIn,CFile::modeRead|CFile::typeBinary)) {
if(OutFile.Open(strOut,CFile::modeCreate|CFile::modeWrite|CFile::typeBinary)) {
while(true) {
iRead = OldFortFile.Read(&iRecLen, sizeof(iRecLen)); //Read the record's raw data
if (iRead < sizeof(iRecLen)) //end of file reached
break;
OutFile.Write(&iRecLen, sizeof(iRecLen));//Write the record's raw data
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
iError++;
break;
}
i = iRecLen;
while (i > 0) {
iLen = (i > BUFLEN) ? BUFLEN : i;
OldFortFile.Read(&pbuf[0], iLen);
OutFile.Write(&pbuf[0], iLen);
i -= iLen;
}
if (i != 0) { //Buffer length mismatch
iError++;
break;
}
OldFortFile.Read(&iRecLen2, sizeof(iRecLen2));
if (iRecLen != iRecLen2) {//ensure we have reached the end of the record proeprly
//Record length mismatch
iError++;
break;
}
OutFile.Write(&iRecLen2, sizeof(iRecLen));
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
break;
}
}
OutFile.Close();
OldFortFile.Close();
}
else { //Could not create the ouput file.
OldFortFile.Close();
return;
}
}
else { //Could not open the input file
}
if (iError == 0)
//File successfully converted
else
//Encountered error

Create an array of values from different text files in C

I'm working in C on 64-bit Ubuntu 14.04.
I have a number of .txt files, each containing lines of floating point values (1 value per line). The lines represent parts of a complex sample, and they're stored as real(a1) \n imag(a1) \n real(a2) \n imag(a2), if that makes sense.
In a specific scenario there are 4 text files each containing 32768 samples (thus 65536 values), but I need to make the final version dynamic to accommodate up to 32 files (the maximum samples per file would not exceed 32768 though). I'll only be reading the first 19800 samples (depending on other things) though, since the entire signal is contained in those 39600 points (19800 samples).
A common abstraction is to represent the files / samples as a matrix, where columns represent return signals and rows represent the value of each signal at a sampling instant, up until the maximum duration.
What I'm trying to do is take the first sample from each return signal and move it into an array of double-precision floating point values to do some work on, move on to the second sample for each signal (which will overwrite the previous array) and do some work on them, and so forth, until the last row of samples have been processed.
Is there a way in which I can dynamically open files for each signal (depending on the number of pulses I'm using in that particular instance), read the first sample from each file into a buffer and ship that off to be processed. On the next iteration, the file pointers will all be aligned to the second sample, it would then move those into an array and ship it off again, until the desired amount of samples (19800 in our hypothetical case) has been reached.
I can read samples just fine from the files using fscanf:
rx_length = 19800;
int x;
float buf;
double *range_samples = calloc(num_pulses, 2 * sizeof(range_samples));
for (i=0; i < 2 * rx_length; i++){
x = fscanf(pulse_file, "%f", &buf);
*(range_samples) = buf;
}
All that needs to happen (in my mind) is that I need to cycle both sample# and pulse# (in that order), so when finished with one pulse it would move on to the next set of samples for the next pulse, and so forth. What I don't know how to do is to somehow declare file pointers for all return signal files, when the number of them can vary inbetween calls (e.g. do the whole thing for 4 pulses, and on the next call it can be 16 or 64).
If there are any ideas / comments / suggestions I would love to hear them.
Thanks.
I would make the code you posted a function that takes an array of file names as an argument:
void doPulse( const char **file_names, const int size )
{
FILE *file = 0;
// declare your other variables
for ( int i = 0; i < size; ++i )
{
file = fopen( file_names[i] );
// make sure file is open
// do the work on that file
fclose( file );
file = 0;
}
}
What you need is a generator. It would be reasonably easy in C++, but as you tagged C, I can imagine a function, taking a custom struct (the state of the object) as parameter. It could be something like (pseudo code) :
struct GtorState {
char *files[];
int filesIndex;
FILE *currentFile;
};
void gtorInit(GtorState *state, char **files) {
// loads the array of file into state, set index to 0, and open first file
}
int nextValue(GtorState *state, double *real, double *imag) {
// read 2 values from currentFile and affect them to real and imag
// if eof, close currentFile and open files[++currentIndex]
// if real and imag were found returns 0, else 1 if eof on last file, 2 if error
}
Then you main program could contain :
GtorState state;
// initialize the list of files to process
gtorInit(&state, files);
double real, imag);
int cr;
while (0 == (cr = nextValue(&state, &real, &imag)) {
// process (real, imag)
}
if (cr == 2) {
// process (at least display) error
}
Alternatively, your main program could iterate the values of the different files and call a function with state analog of the above generator that processes the values, and at the end uses the state of the processing function to get the results.
Tried a slightly different approach and it's working really well.
In stead of reading from the different files each time I want to do something, I read the entire contents of each file into a 2D array range_phase_data[sample_number][pulse_number], and then access different parts of the array depending on which range bin I'm currently working on.
Here's an excerpt:
#define REAL(z,i) ((z)[2*(i)])
#define IMAG(z,i) ((z)[2*(i)+1])
for (i=0; i<rx_length; i++){
printf("\t[%s] Range bin %i. Samples %i to %i.\n", __FUNCTION__, i, 2*i, 2*i+1);
for (j=0; j<num_pulses; j++){
REAL(fft_buf, j) = range_phase_data[2*i][j];
IMAG(fft_buf, j) = range_phase_data[2*i+1][j];
}
printf("\t[%s] Range bin %i done, ready to FFT.\n", __FUNCTION__, i);
// do stuff with the data
}
This alleviates the need to dynamically allocate file pointers and in stead just opens the files one at a time and writes the data to the corresponding column in the matrix.
Cheers.

C on embedded system w/ linux kernel - mysterious adc read issue

I'm developing on an AD Blackfin BF537 DSP running uClinux. I have a total of 32MB SD-RAM available. I have an ADC attached, which I can access using a simple, blocking call to read().
The most interesting part of my code is below. Running the program seems to work just fine, I get a nice data package that I can fetch from the SD-card and plot. However, if I comment out the float calculation part (as noted in the code), I get only zeroes in the ft_all.raw file. The same occurs if I change optimization level from -O3 to -O0.
I've tried countless combinations of all sorts of things, and sometimes it works, sometimes it does not - earlier (with minor modifications to below), the code would only work when optimization was disabled. It may also break if I add something else further down in the file.
My suspicion is that the data transferred by the read()-function may not have been transferred fully (is that possible, even though it returns the correct number of bytes?). This is also the first time I initialize pointers using direct memory adresses, and I have no idea how the compiler reacts to this - perhaps I missed something, here?
I've spent days on this issue now, and I'm getting desperate - I would really appreciate some help on this one! Thanks in advance.
// Clear the top 16M memory for data processing
memset((int *)0x01000000,0x0000,(size_t)SIZE_16M);
/* Prep some pointers for data processing */
int16_t *buffer;
int16_t *buf16I, *buf16Q;
buffer = (int16_t *)(0x1000000);
buf16I = (int16_t *)(0x1600000);
buf16Q = (int16_t *)(0x1680000);
/* Read data from ADC */
int rbytes = read(Sportfd, (int16_t*)buffer, 0x200000);
if (rbytes != 0x200000) {
printf("could not sample data! %X\n",rbytes);
goto end;
} else {
printf("Read %X bytes\n",rbytes);
}
FILE *outfd;
int wbytes;
/* Commenting this region results in all zeroes in ft_all.raw */
float a,b;
int c;
b = 0;
for (c = 0; c < 1000; c++) {
a = c;
b = b+pow(a,3);
}
printf("b is %.2f\n",b);
/* Only 12 LSBs of each 32-bit word is actual data.
* First 20 bits of nothing, then 12 bits I, then 20 bits
* nothing, then 12 bits Q, etc...
* Below, the I and Q parts are scaled with a factor of 16
* and extracted to buf16I and buf16Q.
* */
int32_t *buf32;
buf32 = (int32_t *)buffer;
uint32_t i = 0;
uint32_t n = 0;
while (n < 0x80000) {
buf16I[i] = buf32[n] << 4;
n++;
buf16Q[i] = buf32[n] << 4;
i++;
n++;
}
printf("Saving to /mnt/sd/d/ft_all.raw...");
outfd = fopen("/mnt/sd/d/ft_all.raw", "w+");
if (outfd == NULL) {
printf("Could not open file.\n");
}
wbytes = fwrite((int*)0x1600000, 1, 0x100000, outfd);
fclose(outfd);
if (wbytes < 0x100000) {
printf("wbytes not correct (= %d) \n", (int)wbytes);
}
printf(" done.\n");
Edit: The code seems to work perfectly well if I use read() to read data from a simple file rather than the ADC. This leads me to believe that the rather hacky-looking code when extracting the I and Q parts of the input is working as intended. Inspecting the assembly generated by the compiler confirms this.
I'm trying to get in touch with the developer of the ADC driver to see if he has an explanation of this behaviour.
The ADC is connected through a SPORT, and is opened as such:
sportfd = open("/dev/sport1", O_RDWR);
ioctl(sportfd, SPORT_IOC_CONFIG, spconf);
And here are the options used when configuring the SPORT:
spconf->int_clk = 1;
spconf->word_len = 32;
spconf->serial_clk = SPORT_CLK;
spconf->fsync_clk = SPORT_CLK/34;
spconf->fsync = 1;
spconf->late_fsync = 1;
spconf->act_low = 1;
spconf->dma_enabled = 1;
spconf->tckfe = 0;
spconf->rckfe = 1;
spconf->txse = 0;
spconf->rxse = 1;
A bfin_sport.h file from Analog Devices is also included: https://gist.github.com/tausen/5516954
Update
After a long night of debugging with the previous developer on the project, it turned out the issue was not related to the code shown above at all. As Chris suggested, it was indeed an issue with the SPORT driver and the ADC configuration.
While debugging, this error messaged appeared whenever the data was "broken": bfin_sport: sport ffc00900 status error: TUVF. While this doesn't make much sense in the application, it was clear from printing the data, that something was out of sync: the data in buffer was on the form 0x12000000,0x34000000,... rather than 0x00000012,0x00000034,... whenever the status error was shown. It seems clear then, why buf16I and buf16Q only contained zeroes (since I am extracting the 12 LSBs).
Putting in a few calls to usleep() between stages of ADC initialization and configuration seems to have fixed the issue - I'm hoping it stays that way!

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Resources