Programatically determining file "size on disk" in advance - c

I need to know how big a given in-memory buffer will be as an on-disk (usb stick) file before I write it. I know that unless the size falls on the block size boundary, its likely to get rounded up, e.g. a 1 byte file takes up 4096 bytes on-disk. I'm currently doing this using GetDiskFreeSpace() to work out the disk block size, then using this to calculate the on-disk size like this:
GetDiskFreeSpace(szDrive, &dwSectorsPerCluster,
&dwBytesPerSector, NULL, NULL);
dwBlockSize = dwSectorsPerCuster * dwBytesPerSector;
if (dwInMemorySize % dwBlockSize != 0)
{
dwSizeOnDisk = ((dwInMemorySize / dwBlockSize) * dwBlockSize) + dwBlockSize;
}
else
{
dwSizeOnDisk = dwInMemorySize;
}
Which seems to work fine, BUT GetDiskFreeSpace() only works on disks up to 2GB according to MSDN. GetDiskFreeSpaceEx() doesn't return the same information, so my question is, how else can I calculate this information for drives >2GB? Is there an API call I've missed? Can I assume some hard values depending on the overall disk size?

MSDN only states that the GetDiskFreeSpace() function cannot report volume sizes greater than 2GB. It works fine for retrieving sectors per cluster and bytes per sector, I've used it myself for very similar-looking code ;-)
But if you want disk capacity too, you'll need an additional call to GetDiskFreeSpaceEx().

The size of a file on disk is a fuzzy concept. In NTFS, a file consists of a set of data elements. You're primarilty thinking of the "unnamed data stream". That's an attribute of a file that, if small, can be packed with the other attributes in the directory entry. Apparently, you can store a data stream of up to 700-800 bytes in the directory entry itself. Hence, your hypothetical 1 byte file would be as big as a 0 byte or 700 byte file.
Another influence is file compression. This will make the on-disk size potentially smaller than the in-memory size.

You should be able to obtain this information using the DeviceIoControl function and
DISK_GEOMETRY_EX. It will return a structure that contains the information you are looking for I think
http://msdn.microsoft.com/en-us/library/aa363216(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms809010.aspx

In actionscript!
var size:Number = 19912;
var sizeOnDisk:Number = size;
var reminder:Number = size % (1024 * 4);
if(reminder>0){
sizeOnDisk = size + ((1024 * 4)- reminder)
}
trace(size)
trace(sizeOnDisk)

Related

is there any way to crop an jpg image captured by esp cam?

//I am trying to crop an image captured by espcam the image is in a jpg format I would like to crop it. As the image is stored as a single-dimensional array I tried to rearrange the elements in the array but no changes occurred //
I have cropped the image in RGB565 but I am struggling to understand the single-dimensional array(image buffer)
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.pixel_format = PIXFORMAT_RGB565;
config.frame_size = FRAMESIZE_SVGA;
// config.jpeg_quality = 10;
config.fb_count = 2;
esp_err_t result = esp_camera_init(&config);
if (result != ESP_OK) {
return false;
}
camera_fb_t * fb = NULL;
fb = esp_camera_fb_get();
if (!fb)
{
Serial.println("Camera capture failed");
}
the Fb buffer is a single-dimensional array I want to extract each individual RGB value.
JPG is a compressed format, meaning that your rows and columns are not corresponding to what you would see by displaying a 1:1 grid on the screen. You need to convert it to the plain RGB (or equivalents) format and then copy it.
JPG achieves compression by splitting the image into YCbCR components, using a mathematical transformation and then filtering. For additional information I refer to this page.
Luckily you can follow this tutorial to do the inverse JPEG transformation on an Arduino (tip: forget to do this in real time, unless your time constraints are very relaxed).
The idea is to use a library that converts the JPEG image into an array of data:
Using the library is fairly simple: we give it the JPEG file, and the library will start generating arrays of pixels – so called Minimum Coded Units, or MCUs for short. The MCU is a block of 16 by 8 pixels. The functions in the library will return the color value for each pixel as 16-bit color value. The upper 5 bits are the red value, the middle 6 are green and the lower 5 are blue. Now we can send these values by any sort of communication channel we like.
For your use case you won't send the data through the communication channel, but rather store it in a local array by pushing the blocks into adjacent tiles, then do the crop.
That depends on what kind of hardware (camera and board) you are using.
I'm basing this on the OV2640 camera module because it's the one I've been working with. It delivers the image to the frame buffer already encoded, so I'm guessing this might be what you are facing.
Trying to crop the image after it has been encoded can be tricky, but you might be able to instruct the camera chip to only deliver a certain part of the sensor output in the first place using a window function.
The easiest way to access this setting is to define a function to access this:
void setWindow(int resolution , int xOffset, int yOffset, int xLength, int yLength) {
sensor_t * s = esp_camera_sensor_get();
resolution = 0;
s->set_res_raw(s, resolution, 0, 0, 0, xOffset, yOffset, xLength, yLength, xLength, yLength, true, true);
}
/*
* resolution = 0 \\ 1600 x 1200
* resolution = 1 \\ 800 x 600
* resolution = 2 \\ 400 x 296
*/
where (xOffset,yOffset) is the origin of the window in pixels and (xLength,yLength) is the size of the window in pixels. Be aware that changing the resolution will effectively overwrite these settings. Otherwise this works great for me, although for some reason only if the aspect ratio of 4:3 is preserved in the window size.
Looking at the output format table for the ESP32 Camera Driver one can see that most output formats are non-jpeg. If you can handle a RAW format instead (it will be slower to save/transfer, and be MUCH larger) then that would allow you to more easily crop the image by make a copy with a couple of loops. JPEG is compressed and not easily cropped. The page linked also mentions this:
Using YUV or RGB puts a lot of strain on the chip because writing to PSRAM is not particularly fast. The result is that image data might be missing. This is particularly true if WiFi is enabled. If you need RGB data, it is recommended that JPEG is captured and then turned into RGB using fmt2rgb888 or fmt2bmp/frame2bmp
If you are using PIXFORMAT_RGB565 (which means each pixel value will be kept in TWO bytes, and the image is not jpeg compressed) and FRAMESIZE_SVGA (800x600 pixels), you should be able to access the framebuffer as a two-dimensional array if you want:
uint16_t *buffer = fb->buf;
uint16_t pxl = buffer[row * 800 + column]; // 800 is the SVGA width
// pxl now contains 5 R-bits, 6 G-bits, 5 B-bits

Receiving large data over tcp - is socket.MSG_WAITALL directly into numpy buffer a bad idea?

Summary: I am receiving big image data over a tcp connection. What is the best way to put it into an numpy array without using too much space and copying?
I am getting image data from a hyperspectral camera (having way more than 3 wavelength bands) over tcp. The camera is kind of a blackbox to me. I want to receive the data and have it in a numpy array without too much data copying. I started with the examples around (on the socket module and stackoverflow). But I was not sure if it's the best way so I wanted to use a numpy array as a buffer. But therefor I had to use socket.MSG_WAITALL which I found on GitHub. (because approach 3 - see below - was not working) But it is rarely used and in no example and I learned about TCP sending in unpredictable chunks. So I wanted to know what is behind that, why it should not be used(or should it?) and in general what do you think is the best way to achieve this.
So here are my 3 tries (1 and 2 both run fine and in nearly same time):
import socket
import numpy as np
def receiveImage1(self):
#typical example - should have best performance they said
MSGLEN = 480 * 252 * 640
chunks = []
bytes_recd = 0
while bytes_recd < MSGLEN:
chunk = self.connection.recv(min(MSGLEN - bytes_recd, 2048))
bytes_recd = bytes_recd + len(chunk)
chunks.append(chunk)
#is this a copy?
b = b''.join(chunks)
n = np.frombuffer(b, np.dtype('<H')).reshape((480,252,320))
return n
def receiveImage2(self):
#create buffer and write into it
MSGLEN = 480 * 252 * 640
npbuffer = np.ones((480, 252, 320), np.dtype('<H'))
self.connection.recv_into(npbuffer, MSGLEN, socket.MSG_WAITALL)
return npbuffer
def receiveImage3(self):
#create buffer and write into it in chunks (not working)
MSGLEN = 480 * 252 * 640
npbuffer = np.ones((480, 252, 320), np.dtype('<H'))
bytes_recd = 0
while bytes_recd < MSGLEN:
lgt = self.connection.recv_into(npbuffer,min(MSGLEN - bytes_recd, 2048))
bytes_recd = bytes_recd + lgt
return npbuffer
So 1 and 2 run nearly the same time. I am wondering about advantages and disadvantages and about copies and space used. What would you prefer?
When trying 3 I thought the pointer to the buffer will go with the writing, so it will fill the complete buffer. (Of course it does not, instead it does overwrite the first bytes over and over) So I was trying to find a way to give the buffer some offset or something, but I could not find it.(I was hoping for some behaviour like a C-Pointer) Is there some clever way to do this?
So in general what is the best approach to do this?
Thank you very much for your answers!

DirectShow data copy is TOO slow

Have USB 3.0 HDMI Capture device. It uses YUY2 format (2 bytes per pixel) and 1920x1080 resolution.
Video capture Output Pin connects directly to Video Render input Pin.
And all works good. It shows me 1920x1080 without any freezes.
But I need to make screenshot every second. So this is what I do:
void CaptureInterface::ScreenShoot() {
IMemInputPin* p_MemoryInputPin = nullptr;
hr = p_RenderInputPin->QueryInterface(IID_IMemInputPin, (void**)&p_MemoryInputPin);
IMemAllocator* p_MemoryAllocator = nullptr;
hr = p_MemoryInputPin->GetAllocator(&p_MemoryAllocator);
IMediaSample* p_MediaSample = nullptr;
hr = p_MemoryAllocator->GetBuffer(&p_MediaSample, 0, 0, 0);
long buff_size = p_MediaSample->GetSize(); //buff_size = 4147200 Bytes
BYTE* buff = nullptr;
hr = p_MediaSample->GetPointer(&buff);
//BYTE CaptureInterface::ScreenBuff[1920*1080*2]; defined in header
//--------- TOO SLOW (1.5 seconds for 4 MBytes) ----------
std::memcpy(ScreenBuff, buff, buff_size);
//--------------------------------------------
p_MediaSample->Release();
p_MemoryAllocator->Release();
p_MemoryInputPin->Release();
return;
}
Any other operations with this buffer is very slow too.
But If I use memcpy on other data (2 arrays in my class for example same size 4MB) It is very fast. <0.01sec
Video memory is (might be) slow to read back by its nature (e.g. VMR9 IBasicVideo->GetCurrentImage very slow and you can find other references). You normally want to grab the data before it actually reaches video memory.
Additionally, the way you read data is not quite reliable. You don't know what frame you are actually copying and it might so happen that you even read blackness or garbage, or vice versa your acquiring access to buffer freezes the main video streaming. This is because you are grabbing an unused buffer from pool of available buffers rather than a buffer that corresponds to specific video frame. Your getting an image from such buffer happen in a fragile assumption that unused data from previously streamed frame was initialized and is not yet overwritten by anything else.

C - Store global variables in flash?

As the title may suggest, I'm currently short on SRAM in my program and I can't find a way to reduce my global variables. Is it possible to bring global variables over to flash memory? Since these variables are frequently read and written, would it be bad for the nand flash because they have limited number of read/write cycle?
If the flash cannot handle this, would EEPROM be a good alternative?
EDIT:
Sorry for the ambiguity guys. I'm working with Atmel AVR ATmega32HVB which has:
2K bytes of SRAM,
1K bytes of EEPROM
32K bytes of FLASH
Compiler: AVR C/C++
Platform: IAR Embedded AVR
The global variables that I want to get rid of are:
uint32_t capacityInCCAccumulated[TOTAL_CELL];
and
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
Code snippets:
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
void CCGASG_AccumulateCCADCMeasurements(int32_t ccadcMeasurement, uint16_t slowRCperiod)
{
uint8_t cellIndex;
// Sampling period dependant on configuration of CCADC sampling..
int32_t temp = ccadcMeasurement * (int32_t)slowRCperiod;
bool polChange = false;
if(temp < 0) {
temp = -temp;
polChange = true;
}
// Add 0.5*divisor to get proper rounding
temp += (1<<(CCGASG_ACC_SCALING-1));
temp >>= CCGASG_ACC_SCALING;
if(polChange) {
temp = -temp;
}
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
AccumulatedCCADCvalue[cellIndex] += temp;
}
// If it was a charge, update the charge cycle counter
if(ccadcMeasurement <= 0) {
// If it was a discharge, AccumulatedCADCvalue can be negative, and that
// is "impossible", so set it to zero
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
if(AccumulatedCCADCvalue[cellIndex] < 0)
{
AccumulatedCCADCvalue[cellIndex] = 0;
}
}
}
}
And this
uint32_t capacityInCCAccumulated[TOTAL_CELL];
void BATTPARAM_InitSramParameters() {
uint8_t cellIndex;
// Active current threshold in ticks
battParams_sram.activeCurrentThresholdInTicks = (uint16_t) BATTCUR_mA2Ticks(battParams.activeCurrentThreshold);
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
// Full charge capacity in CC accumulated
battParams_sram.capacityInCCAccumulated[cellIndex] = (uint32_t) CCGASG_mAh2Acc(battParams.fullChargeCapacity);
}
// Terminate discharge limit in CC accumulated
battParams_sram.terminateDischargeLimit = CCGASG_mAh2Acc(battParams.terminateDischargeLimit);
// Values for remaining capacity calibration
GASG_CalculateRemainingCapacityValues();
}
would it be bad for the nand flash because they have limited number of
read/write cycle?
Yes it's not a good idea to use flash for frequent modification of data.
Read only from flash does not reduce the life time of flash. Erasing and writing will reduce the flash lifetime.
Reading and writing from flash is substantially slower compared to conventional memory.
To write a byte whole block has to be erased and re written in flash.
Any kind of Flash is a bad idea to be used for frequently changing values:
limited number of erase/write cycles, see datasheet.
very slow erase/write (erase can be ~1s), see datasheet.
You need a special sequence to erase then write (no language support).
While erasing or writing accesses to Flash are blocked at best, some require not to access the Flash at all (undefined behaviour).
Flash cells cannot freely be written per-byte/word. Most have to be written per page (e.g. 64 bytes) and erased most times in much larger units (segments/blocks/sectors).
For NAND Flash, endurance is even more reduced compared to NOR Flash and the cells are less reliable (bits might flip occasionally or are defective), so you have to add error detection and correction. This is very likely a direction you should not go.
True EEPROM shares most issues, but they might be written byte/word-wise (internal erase).
Note that modern MCU-integrated "EEPROM" is most times also Flash. Some implementations just use slightly more reliable cells (about one decade more erase/write cycles than the program flash) and additional hardware allowing arbitrary byte/word write (automatic erase). But that is still not sufficient for frequent changes.
However, you first should verify if your application can tolerate the lengthly write/erase times. Can you accept a process blocking that long, or rewrite your program acordingly? If the answer is "no", you should even stop further investigation into that direction. Otherwise you should calculate the number of updates over the expected lifetime and compare to the information in the datasheet. There are also methods to reduce the number of erase cycles, but the leads too far.
If an external device (I2C/SPI) is an option, you could use a serial SRAM. Although the better (and likely cheaper) approach would be a larger MCU or think about a more efficient (i.e. less RAM, more code) way to store the data in SRAM.

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Resources