Convert two 16 Bit Registers to 32 Bit float value flutter - arrays

I am using a Modbus flutter lib, reading 2 registers I obtain:
[16177, 4660] I need a function to convert it to a 32 bit float value: 0.7
I found this function
ByteBuffer buffer = new Uint16List.fromList(fileBytes).buffer;
ByteData byteData = new ByteData.view(buffer);
double x = byteData.getFloat32(0);
It says 0.000045747037802357227 swap byte value
Can you help

I always find it easiest to start with the byte array and insert stuff into that. What's happening is that your code is defaulting to the native endianness (apparently little) but you need to be doing this in big endianness.
Here's an overly verbose solution that you can cut out the print statements (and hex codec)
import 'dart:typed_data';
import 'package:convert/convert.dart'; // only needed for debugging
void main() {
final registers = [16177, 4660];
final bytes = Uint8List(4); // start with the 4 bytes = 32 bits
var byteData = bytes.buffer.asByteData(); // ByteData lets you choose the endianness
byteData.setInt16(0, registers[0], Endian.big); // Note big is the default here, but shown for completeness
byteData.setInt16(2, registers[1], Endian.big);
print(bytes); // just for debugging - does my byte order look right?
print(hex.encode(bytes)); // ditto
final f32 = byteData.getFloat32(0, Endian.big);
print(f32);
}

Related

Auto-Allocate to a specific RAM Area in GCC/C

Sorry for my english, its a bit hard for me to explain what exactly i would need.
I'm making some extra code into existing binarys using the GCC compiler.
In this case, its PowerPC, but it should not really matter.
I know, where in the existing binary i have free ram available (i dumped the full RAM to make sure) but i need to define each RAM address manually, currently i am doing it like this:
// #ram.h
//8bit ram
uint8_t* xx1 = (uint8_t*) 0x807F00;
uint8_t* xx2 = (uint8_t*) 0x807F01;
//...and so on
// 16bit ram
uint16_t* xxx1 = (uint16_t*) 0x807F40;
uint16_t* xxx2 = (uint16_t*) 0x807F42;
//...and so on
// 32bit ram
uint32_t* xxxx1 = (uint32_t*) 0x807FA0;
uint32_t* xxxx2 = (uint32_t*) 0x807FA4;
//...and so on
And im accessing my variables like this:
void __attribute__ ((noinline)) silly_demo_function() {
#include "ram.h"
if (*xxx2>*xx1) {
*xxx3 = *xxx3 + *xx1;
}
return;
}
But this gets really boring, if i want to patch my code into another existing binary, where the location of available/free/unused ram can be fully different, or if im replacing/removing some value in the middle. I am using 8, 16 and 32bit variables.
Is there a way, i can define an area like 0x807F00 to 0x00808FFF, and allocate my variables on the fly, and the compiler will allocate it inside my specific location?
I suspect that the big problem here is that those addresses are memory mapped IO (devices) and not RAM; and should not be treated as RAM.
Further, I'd say that you probably should be hiding the "devices that aren't RAM" behind an abstract layer, a little bit like a device driver; partly so that you can make sure that the compiler complies with any constraints caused by it being IO and not RAM (e.g. treated as volatile, possibly honoring any access size restrictions, possibly taking care of any cache coherency management); partly so that you/programmers know what is normal/fast/cached RAM and what isn't; partly so that you can replace the "device" with fake code for testing; and partly so that it's all kept in a single well defined area.
For example; you might have a header file called "src/devices.h" that contains:
#define xx1_address 0x807F00
..and the wrapper code might be a file called "src/devices/xx1.c" that contains something like:
#include "src/devices.h"
static volatile uint8_t * xx1 = (uint8_t*) xx1_address;
uint8_t get_xx1(void) {
return *xx1;
}
void set_xx1(uint8_t x) {
*xx1 = x;
}
However; depending on what these devices actually are, you might need/want some higher level code. For example, maybe xx1 is a temperature sensor and it doesn't make any sense to try to set it, and you want it to scale that raw value so it's "degrees celsius", and the highest bit of the raw value is used to indicate an error condition (and the actual temperature is only 7 bits), so the wrapper might be more like:
#include "src/devices.h"
#define xx1_offset -12.34
#define xx1_scale 1.234
static volatile uint8_t * xx1 = (uint8_t*) xx1_address;
float get_xx1_temperature(void) {
uint8_t raw_temp = *xx1;
if(raw_temp * 0x80 != 0) {
/* Error flag set */
return NAN;
}
/* No error */
return (raw_temp + xx1_offset) * xx1_scale;
}
In the meanwhile, i figured it out.
Its just as easy as defining .data, .bsss and .sbss in the linker directives.
6 Lines of code and its working like a charm.

Good way to access mixed 8/16/32-bit words

I have a big lump of binary data in memory and I need to read/write from randomly accessed, byte-aligned addresses. However, sometimes I need to read/write 8-bit words, sometimes (big-endian) 16-bit words, and sometimes (big-endian) 32-bit ones.
There's the naïve solution of representing the data as a ByteArray and implementing 16/32-bit reads/writes by hand:
class Blob (val image: ByteArray, var ptr: Int = 0) {
fun readWord8(): Byte = image[ptr++]
fun readWord16(): Short {
val hi = readWord8().toInt() and 0xff
val lo = readWord8().toInt() and 0xff
return ((hi shl 8) or lo).toShort()
}
fun readWord32(): Int {
val hi = readWord16().toLong() and 0xffff
val lo = readWord16().toLong() and 0xffff
return ((hi shl 16) or lo).toInt()
}
}
(and similarly for writeWord8/writeWord16/writeWord32).
Is there a better way to do this? It just seems so inefficient doing all this byte-shuffling when Java itself already uses big-endian representation inside...
To reiterate, I need both read and write access, random seeks, and 8/16/32-bit access to big-endian words.
You can use Java NIO ByteBuffer:
val array = ByteArray(100)
val buffer = ByteBuffer.wrap(array)
val b = buffer.get()
val s = buffer.getShort()
val i = buffer.getInt()
buffer.put(0.toByte())
buffer.putShort(0.toShort())
buffer.putInt(0)
buffer.position(10)
The byte order of a newly created ByteBuffer is BIG_ENDIAN, but it can still be changed with the order(ByteOrder) function.
Also, use ByteBuffer.allocate(size) and buffer.array() if you want to avoid creating a ByteArray explicily.
More about ByteBuffer usage: see this question.

Correct way to unpack a 32 bit vector in Perl to read a uint32 written in C

I am parsing a Photoshop raw, 16 bit/channel, RGB file in C and trying to keep a log of exceptional data points. I need a very fast C analysis of up to 36 MPix images with 16 bit quanta or 216 MB Photoshop .RAW files.
<1% of the points have weird skin tones and I want to graph them with PerlMagick or Perl GD to see where they are coming from.
The first 4 bytes of the C data file contain the unsigned image width as a uint32_t. In Perl, I read the whole file in binary mode and extract the first 32 bits:
Xres=1779105792l = 0x6a0b0000
It looks a lot like the C log file:
DA: Color anomalies=14177=0.229%:
DA: II=1) raw PIDX=0x10000b25, XCols=[0]=0x00000b6a
Dec(0x00000b6a) = 2922, the Exact X_Columns_Width of a small test file.
Clearly a case of intel's 1972 8008 NUXI architecture. How hard could it possibly be to translate 0x6a0b0000 to 0x6a0b0000; swap 2 bytes and 2 nibbles and you're done. Slicing the 8 characters and rearranging them could be done but that is the kind of ugly hack I am trying to avoid.
Grab the same 32 bit vector from file offset zero and unpack it as "VAX" unsigned long.
$xres = vec($bdat, 0, 32); # vec EXPR,OFFSET,BITS
$vul = unpack("V", vec($bdat, 0, 32));
printf("Length (\$bdat)=%d, xres=0x%08x, Vax ulong=%ul=0x%08x\n",
length($bdat), $xres, $vul, $vul);
Length ($bdat) = 56712, xres=0x6a0b0000, Vax ulong=959919921l=0x39373731
Every single hex character is mangled. Obviously wrong Endian, it is not VAX. The "Other" one is Network Big-endian
http://perldoc.perl.org/functions/pack.html
N An unsigned long (32-bit) in "network" (big-endian) order.
V An unsigned long (32-bit) in "VAX" (little-endian) order.
$nul = unpack("N", vec($bdat, 0, 32)); # Network Unsigned Long 32b
printf("Xres=0x%08x, NET ulong=%ul=0x%08x\n", $xres, $nul, $nul);
Xres=0x6a0b0000, NET ulong=825702201l=0x31373739
The $XRES still shows the right hex in the wrong order. The "NETWORK" long 32 bit uint extracted from the same bits is unrecognizable. Try Binary
$bits = unpack("b*", vec($bdat, 0, 32));
printf("bits=$bits, len=%d\n", length $bits);
bits=10001100111011001110110010011100100011000000110010101100111011001001110001001100, len=80
I clearly asked for 32 bits and got 80 bits. What gives?
Try for 4, unsigned, 8bit bytes which can NOT be swapped:
for($ii = 0; $ii < 4; $ii++) {
$bit_off=$ii*8; # Bit offset
$uc = unpack("C", vec($bdat, $bit_off, 8)); # C An unsigned char
printf("II $ii, bo $bit_off, d=%d, u=%u, x=0x%x\n",
$uc,$uc, $uc);
}
II 0, bo 0, d=49, u=49, x=0x31
II 1, bo 8, d=51, u=51, x=0x33
II 2, bo 16, d=49, u=49, x=0x31
II 3, bo 24, d=49, u=49, x=0x31
I am looking for hex 0, 6, a or b. There are no "3"s or "1"s in the right answer. Try pirating from a C file:
http://cpansearch.perl.org/src/MHX/Convert-Binary-C-0.76/tests/include/include/bits/byteswap.h
$x = $xres;
$x= (((($x) & 0xff000000) >> 24) | ((($x) & 0x00ff0000) >> 8) | ((($x) & 0x0000ff00) << 8) | ((($x) & 0x000000ff) << 24));
printf("\$xres=0x%08x -> \$x=0x%08x = %u\n", $xres, $x, $x);
$xres=0x6a0b0000 -> $x=0x00000b6a = 2922
It WORKS! But, this is uglier than converting the original, wrong order hex number to a string to untangle it:
$stupid_str = sprintf("%08x", $xres);
$stupid_num = join('', reverse ($stupid_str =~ m/../g));
printf("Stupid_num '%s'->0x%08x=%d\n", $stupid_num, $dec=hex $stupid_num, $dec);
Stupid_num '00000b6a'->0x00000b6a=2922
It's like judging the Ugliest Dog contest, but I would still rather have to maintain the text version than the even more abominable C version.
I know there are ways to do this in Java/Python/Go/Ruby/.....
I know there are command line utilities that do exactly this.
I must figure out how I am misusing either VEC or Unpack, both of which I have used a zillion times. It is the Brain Teasing aspect which is driving me nuts! EndianNess == EndianMess!!!
TYVM!
=================================================
Borodin,
Thanks for lookin' at this.
My intel processor is little-endian. When I read it back, it was trans-mutilated by vec to the "correct" big-endian, network format.
I just tried reading it VERBATIM from a BINARY file read and it works fine:
($b4 = $bdat) =~ s/^(....).*$/$1/msg; # Give me my 4 bytes back without mutilation!
printf("B4='%s'=>0x%08x=<0x%08x\n", $b4, unpack("L>", $b4), unpack("L<", $b4));
B4='j...' = >0x6a0b0000 = <0x00000b6a <<< THE RIGHT ANSWER!!!
If you try unpack 'V', $bdat then you will find that it works
That was my first attempt:
$vul = unpack("V", vec($bdat, 0, 32)); # UNPACK V!
printf("Length (\$bdat)=%d, xres=0x%08x, Vax ulong=%ul=0x%08x\n",
length($bdat), $xres, $vul, $vul);
Length ($bdat) = 56712, xres=0x6a0b0000, Vax ulong=959919921l=0x39373731 <<<< TOTALLY WRONG!
I had already verified that the $BDAT info was the right data in the wrong format. It just needed some rearrangement.
I just used vec() to generate 1 bit and 4 bit graphics files and it worked faithfully, returning the exact bits I wrote. It must have mistaken my Intel i7 for my IBM System/370. I7/37??? Easy mistake to make. :)
I read the [confusing] part about "converted to a number as with pack ...". That's why my number was backward. The >>unpack("V", vec($bdat"<< ... was my ill-fated attempt to byte-swap the backward number in $BDAT from the WRONG VEC()-preferred FORMAT to the native format supported by my architecture.
Now I understand why I saw so many examples of people extracting by the byte, to avoid Big Brother's helping hand!
Data::BitStream::Vec "uses a Perl vec to store the data. The vector is accessed in 1-bit units"
Thanks 1E6,
B
You are confusing things by combining vec with unpack
The correct way is simply
unpack 'V', $bdat
which returns a value of 0x00000B6A as you expect
vec($bdat, 0, 32) is equivalent to unpack 'N', $bdat as you can see from the value of $xres in your first code block, and the documentation for vec confirms this with
If BITS is 16 or more, bytes of the input string are grouped into chunks of size BITS/8, and each group is converted to a number as with pack()/unpack() with big-endian formats n/N
The line
$vul = unpack("V", vec($bdat, 0, 32))
is very wrong, because the decimal value of vec($bdat, 0, 32) is 1779105792, so you are then calling unpack on the string "1779105792" which doesn't do anything useful at all

How can I deal with given situtaion related to Hardware change

I am maintaining a Production code related to FPGA device .Earlier resisters on FPGA are of 32 bits and read/write to these registers are working fine.But Hardware is changed and so did the FPGA device and with latest version of FPGA device we have trouble in read and write to FPGA register .After some R&D we came to know FPGA registers are no longer 32 bit ,it is now 31 bit registers and same has been claimed by FPGA device vendor.
So there is need to change small code as well.Earlier we were checking that address of registers are 4 byte aligned or not(because registers are of 32 bits)now with current scenario we have to check address are 31 bit aligned.So for the same we are going to check
if the most significant bit of the address is set (which means it is not a valid 31 bit).
I guess we are ok here.
Now second scenario is bit tricky for me.
if read/write for multiple registers that is going to go over the 0x7fff-fffc (which is the maximum address in 31 bit scheme) boundary, then have to handle request carefully.
Reading and Writing for multiple register takes length as an argument which is nothing but number of register to be read or write.
For example, if the read starts with 0x7fff-fff8, and length for the read is 5. Then actually, we can only read 2 registers (which is 0x7fff-fff8, and 0x7fff-fffc).
Now could somebody suggest me some kind of pseudo code to handle this scenario
Some think like below
while(lenght>1)
{
if(!(address<<(lenght*31) <= 0x7fff-fffc))
{
length--;
}
}
I know it is not good enough but something in same line which I can use.
EDIT
I have come up with a piece of code which may fulfill my requirement
int count;
Index_addr=addr;
while(Index_add <= 7ffffffc)
{
/*Wanted to move register address to next register address,each register is 31 bit wide and are at consecutive location. like 0x0,0x4 and 0x8 etc.*/
Index_add=addr<<1; // Guess I am doing wrong here ,would anyone correct it.
count++;
}
length=count;
The root problem seems to be that the program is not properly treating the FPGA registers.
Data encapsulation would help, and, instead of treating the 31-bit FPGA registers as memory locations, they should be abstracted.
The FPGA should be treated as a vector (a one-dimensional array) of registers.
The vector of N FPGA registers should be addressable by an register index in the range of 0x0000 through N-1.
The FPGA registers are memory mapped at base addr.
So the memory address = 4 * FPGA register index + base addr.
Access to the FPGA registers should be encapsulated by read and write procedures:
int read_fpga_reg(int reg_index, uint32_t *reg_valp)
{
if (reg_index < 0 || reg_index >= MAX_REG_INDEX)
return -1; /* error return */
*reg_valp = *(uint32_t *)(reg_index << 2 + fpga_base_addr);
return 0;
}
As long as MAX_REG_INDEX and fpga_base_addr are properly defined, then this code will never generate an invalid memory access.
I'm not absolutely sure I'm interpreting the given scenario correctly. But here's a shot at it:
// Assuming "address" starts 4-byte aligned and is just defined as an integer
unsigned uint32_t address; // (Assuming 32-bit unsigned longs)
while ( length > 0 ) // length is in bytes
{
// READ 4-byte value at "address"
// Mask the read value with 0x7FFFFFFF since there are 31 valid bits
// 32 bits (4 bytes) have been read
if ( (--length > 0) && (address < 0x7ffffffc) )
address += 4;
}

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Resources