Techniques for handling short reads/writes with scatter-gather? - c

Scatter-gather - readv()/writev()/preadv()/pwritev() - reads/writes a variable number of iovec structs in a single system call. Basically it reads/write each buffer sequentially from the 0th iovec to the Nth. However according to the documentation it can also return less on the readv/writev calls than was requested. I was wondering if there is a standard/best practice/elegant way to handle that situation.
If we are just handling a bunch of character buffers or similar this isn't a big deal. But one of the niceties is using scatter-gather for structs and/or discrete variables as the individual iovec items. How do you handle the situation where the readv/writev only reads/writes a portion of a struct or half of a long or something like that.
Below is some contrived code of what I am getting at:
int fd;
struct iovec iov[3];
long aLong = 74775767;
int aInt = 949;
char aBuff[100]; //filled from where ever
ssize_t bytesWritten = 0;
ssize_t bytesToWrite = 0;
iov[0].iov_base = &aLong;
iov[0].iov_len = sizeof(aLong);
bytesToWrite += iov[0].iov_len;
iov[1].iov_base = &aInt;
iov[1].iov_len = sizeof(aInt);
bytesToWrite += iov[1].iov_len;
iov[2].iov_base = &aBuff;
iov[2].iov_len = sizeof(aBuff);
bytesToWrite += iov[2].iov_len;
bytesWritten = writev(fd, iov, 3);
if (bytesWritten == -1)
{
//handle error
}
if (bytesWritten < bytesToWrite)
//how to gracefully continue?.........

Use a loop like the following to advance the partially-processed iov:
for (;;) {
written = writev(fd, iov+cur, count-cur);
if (written < 0) goto error;
while (cur < count && written >= iov[cur].iov_len)
written -= iov[cur++].iov_len;
if (cur == count) break;
iov[cur].iov_base = (char *)iov[cur].iov_base + written;
iov[cur].iov_len -= written;
}
Note that if you don't check for cur < count you will read past the end of iov which might contain zero.

AFAICS the vectored read/write functions work the same wrt short reads/writes as the normal ones. That is, you get back the number of bytes read/written, but this might well point into the middle of a struct, just like with read()/write(). There is no guarantee that the possible "interruption points" (for lack of a better term) coincide with the vector boundaries. So unfortunately the vectored IO functions offer no more help for dealing with short reads/writes than the normal IO functions. In fact, it's more complicated since you need to map the byte count into an IO vector element and offset within the element.
Also note that the idea of using vectored IO for individual structs or data items might not work that well; the max allowed value for the iovcnt argument (IOV_MAX) is usually quite small, something like 1024 or so. So if you data is contiguous in memory, just pass it as a single element rather than artificially splitting it up.

Vectored write will write all the data you have provided with one call to "writev" function. So byteswritten will be always be equal to total number of bytes provided as input. this is what my understanding is.
Please correct me if I am wrong

Related

AF_XDP: Relationship between `FRAME_SIZE` and actual size of packet

My AF-XDP userspace program is based on this tutorial: https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP
I am currently trying to parse ~360.000 RTP-packets per second (checking for continuous sequence numbers) but I loose around 25 per second (this means that for 25 packets the statement previous_rtp_sqnz_nmbr + 1 == current_rtp_sqnz_nmbr doesn't hold true).
So I tried to increase the number of allocated packets NUM_FRAMES from 228.000 to 328.000. With default FRAME_SIZE of XSK_UMEM__DEFAULT_FRAME_SIZE = 4096 this results in 1281Mbyte being allocated (no problem because I have 32GB of RAM) but for whatever reason, this function call:
static struct xsk_umem_info *configure_xsk_umem(void *buffer, uint64_t size)
{
printf("Try to allocate %lu\n", size);
struct xsk_umem_info *umem;
int ret;
umem = calloc(1, sizeof(*umem));
if (!umem)
return NULL;
ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
NULL);
if (ret) {
errno = -ret;
return NULL;
}
umem->buffer = buffer;
return umem;
}
fails with
Try to allocate 1343488000
ERROR: Can't create umem "Cannot allocate memory"
I don't know why? But because I know that my RTP-packets are not larger than 1500bytes, I set FRAME_SIZE 3072 so I am now at around 960Mbyte (which works without an error).
However, I am now loosing half of the received packets (this means that for 180.000 packets the previous sequence number doesn't line up with the current sequence number).
Because of this I ask the question: What is the relationship between FRAME_SIZE and the actual size of a packet? Because obviously it can not be the same.
Edit: I am using 5.4.0-4-amd64 #1 SMP Debian 5.4.19-1 (2020-02-13) x86_64 GNU/Linux and just copied the libbpf-repository from here into my code-base: https://github.com/libbpf/libbpf
So I don't know whether the error mentioned here: https://github.com/xdp-project/xdp-tutorial/issues/76 is still valid?

Reading serial port faster

I have a computer software that sends RGB color codes to Arduino using USB. It works fine when they are sent slowly but when tens of them are sent every second it freaks out. What I think happens is that the Arduino serial buffer fills out so quickly that the processor can't handle it the way I'm reading it.
#define INPUT_SIZE 11
void loop() {
if(Serial.available()) {
char input[INPUT_SIZE + 1];
byte size = Serial.readBytes(input, INPUT_SIZE);
input[size] = 0;
int channelNumber = 0;
char* channel = strtok(input, " ");
while(channel != 0) {
color[channelNumber] = atoi(channel);
channel = strtok(0, " ");
channelNumber++;
}
setColor(color);
}
}
For example the computer might send 255 0 123 where the numbers are separated by space. This works fine when the sending interval is slow enough or the buffer is always filled with only one color code, for example 255 255 255 which is 11 bytes (INPUT_SIZE). However if a color code is not 11 bytes long and a second code is sent immediately, the code still reads 11 bytes from the serial buffer and starts combining the colors and messes them up. How do I avoid this but keep it as efficient as possible?
It is not a matter of reading the serial port faster, it is a matter of not reading a fixed block of 11 characters when the input data has variable length.
You are telling it to read until 11 characters are received or the timeout occurs, but if the first group is fewer than 11 characters, and a second group follows immediately there will be no timeout, and you will partially read the second group. You seem to understand that, so I am not sure how you conclude that "reading faster" will help.
Using your existing data encoding of ASCII decimal space delimited triplets, one solution would be to read the input one character at a time until the entire triplet were read, however you could more simply use the Arduino ReadBytesUntil() function:
#define INPUT_SIZE 3
void loop()
{
if (Serial.available())
{
char rgb_str[3][INPUT_SIZE+1] = {{0},{0},{0}};
Serial.readBytesUntil( " ", rgb_str[0], INPUT_SIZE );
Serial.readBytesUntil( " ", rgb_str[1], INPUT_SIZE );
Serial.readBytesUntil( " ", rgb_str[2], INPUT_SIZE );
for( int channelNumber = 0; channelNumber < 3; channelNumber++)
{
color[channelNumber] = atoi(channel);
}
setColor(color);
}
}
Note that this solution does not require the somewhat heavyweight strtok() processing since the Stream class has done the delimiting work for you.
However there is a simpler and even more efficient solution. In your solution you are sending ASCII decimal strings then requiring the Arduino to spend CPU cycles needlessly extracting the fields and converting to integer values, when you could simply send the byte values directly - leaving if necessary the vastly more powerful PC to do any necessary processing to pack the data thus. Then the code might be simply:
void loop()
{
if( Serial.available() )
{
for( int channelNumber = 0; channelNumber < 3; channelNumber++)
{
color[channelNumber] = Serial.Read() ;
}
setColor(color);
}
}
Note that I have not tested any of above code, and the Arduino documentation is lacking in some cases with respect to descriptions of return values for example. You may need to tweak the code somewhat.
Neither of the above solve the synchronisation problem - i.e. when the colour values are streaming, how do you know which is the start of an RGB triplet? You have to rely on getting the first field value and maintaining count and sync thereafter - which is fine until perhaps the Arduino is started after data stream starts, or is reset, or the PC process is terminated and restarted asynchronously. However that was a problem too with your original implementation, so perhaps a problem to be dealt with elsewhere.
First of all, I agree with #Thomas Padron-McCarthy. Sending character string instead of a byte array(11 bytes instead of 3 bytes, and the parsing process) is wouldsimply be waste of resources. On the other hand, the approach you should follow depends on your sender:
Is it periodic or not
Is is fixed size or not
If it's periodic you can check in the time period of the messages. If not, you need to check the messages before the buffer is full.
If you think printable encoding is not suitable for you somehow; In any case i would add an checksum to the message. Let's say you have fixed size message structure:
typedef struct MyMessage
{
// unsigned char id; // id of a message maybe?
unsigned char colors[3]; // or unsigned char r,g,b; //maybe
unsigned char checksum; // more than one byte could be a more powerful checksum
};
unsigned char calcCheckSum(struct MyMessage msg)
{
//...
}
unsigned int validateCheckSum(struct MyMessage msg)
{
//...
if(valid)
return 1;
else
return 0;
}
Now, you should check every 4 byte (the size of MyMessage) in a sliding window fashion if it is valid or not:
void findMessages( )
{
struct MyMessage* msg;
byte size = Serial.readBytes(input, INPUT_SIZE);
byte msgSize = sizeof(struct MyMessage);
for(int i = 0; i+msgSize <= size; i++)
{
msg = (struct MyMessage*) input[i];
if(validateCheckSum(msg))
{// found a message
processMessage(msg);
}
else
{
//discard this byte, it's a part of a corrupted msg (you are too late to process this one maybe)
}
}
}
If It's not a fixed size, it gets complicated. But i'm guessing you don't need to hear that for this case.
EDIT (2)
I've striked out this edit upon comments.
One last thing, i would use a circular buffer. First add the received bytes into the buffer, then check the bytes in that buffer.
EDIT (3)
I gave thought on comments. I see the point of printable encoded messages. I guess my problem is working in a military company. We don't have printable encoded "fire" arguments here :) There are a lot of messages come and go all the time and decoding/encoding printable encoded messages would be waste of time. Also we use hardwares which usually has very small messages with bitfields. I accept that it could be more easy to examine/understand a printable message.
Hope it helps,
Gokhan.
If faster is really what you want....this is little far fetched.
The fastest way I can think of to meet your needs and provide synchronization is by sending a byte for each color and changing the parity bit in a defined way assuming you can read the parity and bytes value of the character with wrong parity.
You will have to deal with the changing parity and most of the characters will not be human readable, but it's gotta be one of the fastest ways to send three bytes of data.

How can I use a dynamic array for UDP receipt in D?

I would like to create a simple UDP server that can receive messages of varying length. However, it seems as if D's Socket.receiveFrom() expects a static length buffer array. When the following code runs:
void main() {
UdpSocket server_s;
Address client_addr;
ubyte[] in_buf;
ptrdiff_t bytesin;
server_s = new UdpSocket();
server_s.bind(new InternetAddress(InternetAddress.ADDR_ANY, PORT_NUM));
bytesin = server_s.receiveFrom(in_buf, client_addr);
if (bytesin == 0 || bytesin == Socket.ERROR) {
writeln("Error receiving, bytesin: ", bytesin);
return;
}
// Do stuff
}
receiveFrom() immediately falls through with bytesin == 0. Why is this? Can I even use dynamic arrays for receiving over UDP?
receive and receiveFrom don't do the allocation themselves. You can pass an array of some fixed size big enough to hold any packet you expect, then slice it based on how many bytes you received.
If you preallocated 64 KB, that ought to fit everything you could conceivably get. I tend to just use 4 KB buffers though.
ubyte[4096] in_buf;
bytesin = server_s.receiveFrom(in_buf, client_addr);
// check for error first then
auto message_received = in_buf[0 .. bytesin];
// process it
// keep looping, reusing the buffer, to get more stuff

Using bzip2 low-level routines to compress chunks of data

The Overview
I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.
I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).
A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.
Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.
The Problem
The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.
For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).
My rough pseudocode is in two parts.
The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):
bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;
/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0);
/* bzError checking... */
do
{
/* read some bytes into myBuffer... */
/* compress bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
/* error checking... */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError == BZ_OK);
}
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);
Now we wrap up the output:
/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */
/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
/* bzError error checking... */
/* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError != BZ_STREAM_END);
/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);
/* bzError checking... */
The Questions
Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?
I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.
Alternatively, am I not understanding the correct use of the bz_stream state flags?
For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?
There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.
In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:
bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;
The rest of the calculations follow.

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Resources