Using bzip2 low-level routines to compress chunks of data

Using bzip2 low-level routines to compress chunks of data - c

The Overview
I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.
I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).
A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.
Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.
The Problem
The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.
For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).
My rough pseudocode is in two parts.
The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):
bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;
/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0);
/* bzError checking... */
do
{
/* read some bytes into myBuffer... */
/* compress bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
/* error checking... */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError == BZ_OK);
}
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);
Now we wrap up the output:
/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */
/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
/* bzError error checking... */
/* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError != BZ_STREAM_END);
/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);
/* bzError checking... */
The Questions
Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?
I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.
Alternatively, am I not understanding the correct use of the bz_stream state flags?
For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?
There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.

In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:
bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;
The rest of the calculations follow.

Related

C. Loop compression + send (gzip) ZLIB

I'm currently building an HTTP server in C.
Please mind this piece of code :
#define CHUNK 0x4000
z_stream strm;
unsigned char out[CHUNK];
int ret;
char buff[200];
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
int windowsBits = 15;
int GZIP_ENCODING = 16;
ret = deflateInit2(&strm, Z_BEST_SPEED, Z_DEFLATED, windowsBits | GZIP_ENCODING, 1,
Z_DEFAULT_STRATEGY);
fill(buff); //fill buff with infos
do {
strm.next_in = (z_const unsigned char *)buff;
strm.avail_in = strlen(buff);
do {
strm.avail_out = CHUNK;
strm.next_out = out;
ret = deflate(&strm, Z_FINISH);
} while (strm.avail_out == 0);
send_to_client(out); //sending a part of the gzip encoded string
fill(buff);
}while(strlen(buff)!=0);
The idea is : I'm trying to send gzip'ed buffers, one by one, that (when they're concatened) is a whole body request.
BUT : for now, my client (a browser) only get the infos of the first buffer. No errors at all though.
How do I achieve this job, how to gzip some buffers inside a loop so I can send them everytime (in the loop) ?

First off, you need to do something with the generated deflate data after each deflate() call. Your code discards the compressed data generated in the inner loop. From this example, after the deflate() you would need something like:
have = CHUNK - strm.avail_out;
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
That's where your send_to_client needs to be, sending have bytes.
In your case, your CHUNK is so much larger than your buff, that loop is always executing only once, so you are not discarding any data. However that is only happening because of the Z_FINISH, so when you make the next fix, you will be discarding data with your current code.
Second, you are finishing the deflate() stream each time after no more than 199 bytes of input. This will greatly limit how much compression you can get. Furthermore, you are sending individual gzip streams, for which most browsers will only interpret the first one. (This is actually a bug in those browsers, but I don't imagine they will be fixed.)
You need to give the compressor at least 10's to 100's of Kbytes to work with in order get decent compression. You need to use Z_NO_FLUSH instead of Z_FINISH until you get to your last buff you want to send. Then you use Z_FINISH. Again, take a look at the example and read the comments.

Get the character dominant from a string

Okay.. according to the title i am trying to figure out a way - function that returns the character that dominates in a string. I might be able to figure it out.. but it seems something is wrong with my logic and i failed on this. IF someome can come up with this without problems i will be extremelly glad thank you.
I say "in a string" to make it more simplified. I am actually doing that from a buffered data containing a BMP image. Trying to output the base color (the dominant pixel).
What i have for now is that unfinished function i started:
RGB
bitfox_get_primecolor_direct
(char *FILE_NAME)
{
dword size = bmp_dgets(FILE_NAME, byte);
FILE* fp = fopen(convert(FILE_NAME), "r");
BYTE *PIX_ARRAY = malloc(size-54+1), *PIX_CUR = calloc(sizeof(RGB), sizeof(BYTE));
dword readed, i, l;
RGB color, prime_color;
fseek(fp, 54, SEEK_SET); readed = fread(PIX_ARRAY, 1, size-54, fp);
for(i = 54; i<size-54; i+=3)
{
color = bitfox_pixel_init(PIXEL_ARRAY[i], PIXEL_ARRAY[i+1], PIXEL_ARRAY[i+2);
memmove(PIX_CUR, color, sizeof(RGB));
for(l = 54; l<size-54; l+=3)
{
if (PIX_CUR[2] == PIXEL_ARRAY[l] && PIX_CUR[1] == PIXEL_ARRAY[l+1] &&
PIX_CUR[0] == PIXEL_ARRAY[l+2])
{
}
Note that RGB is a struct containing 3 bytes (R, G and B).
I know thats nothing but.. thats all i have for now.
Is there any way i can finish this?

If you want this done fast throw a stack of RAM at it (if available, of course). You can use a large direct-lookup table with the RGB trio to manufacture a sequence of 24bit indexes into a contiguous array of counters. In partial-pseudo, partial code, something like this:
// create a zero-filled 2^24 array of unsigned counters.
uint32_t *counts = calloc(256*256*256, sizeof(*counts));
uint32_t max_count = 0
// enumerate your buffer of RGB values, three bytes at a time:
unsigned char rgb[3];
while (getNextRGB(src, rgb)) // returns false when no more data.
{
uint32_t idx = (((uint32_t)rgb[0]) << 16) | (((uint32_t)rgb[1]) << 8) | (uint32_t)rgb[2];
if (++counts[idx] > max_count)
max_count = idx;
}
R = (max_count >> 16) & 0xFF;
G = (max_count >> 8) & 0xFF;
B = max_count & 0xFF;
// free when you have no more images to process. for each new
// image you can memset the buffer to zero and reset the max
// for a fresh start.
free(counts);
Thats it. If you can afford to throw a big hulk of memory at this a (it would be 64MB in this case, at 4 bytes per entry at 16.7M entries), then performing this becomes O(N). If you have a succession of images to process you can simply memset() the array back to zeros, clear max_count, and repeat for each additional file. Finally, don't forget to free your memory when finished.
Best of luck.

Block ram disk fails to read/write with offset

I'm creating a very very simple block RAM disk based on sbull.
So far it works fine if I read/write blocks of data using dd, but whenever I try mounting a filesystem on it (and sometimes creating a file system) my driver crashes.
After long weeks of debugging, I finally found out what is wrong, even though I can't really find a way to solve the problem. Hence my question here :)
Whenever a user space application creates a request to the device WITH AN OFFSET, the driver won't work! Let me show you the source code in order to clarify:
First of all, I'm handling requests using mk_request (not using a request_queue):
static void escsi_mk_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct escsi_dev *esd = bdev->bd_disk->private_data;
int rw;
struct bio_vec *bvec;
sector_t sector;
int i;
int err = -EIO;
printk("request received nr. sectors = %lu\n",bio_sectors(bio));
sector = bio->bi_sector;
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto out;
if (unlikely(bio->bi_rw & REQ_DISCARD)) {
err = 0;
goto out;
}
rw = bio_rw(bio);
if (rw == READA)
rw = READ;
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
err = esd_do_bvec(esd, bvec->bv_page, len, bvec->bv_offset, rw, sector);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
out:
bio_endio(bio, err);
}
The esd_do_bvec function:
static int esd_do_bvec(struct escsi_dev *esd, struct page *page,
unsigned int len, unsigned int off, int rw,
sector_t sector)
{
void *mem;
int err = 0;
unsigned int offset;
int i;
offset = off + sector * 512;
printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
mem = kmap_atomic(page);
if (rw == READ) {
memcpy(mem,esd->data+offset,len);
} else {
memcpy(esd->data+offset,mem,len);
}
kunmap_atomic(mem);
out:
return err;
}
OK, so basically when I read or write data using dd, the variable "off" in esd_do_bvec() is always 0, regardless of where and how many bytes I want to write. The file system obviously always performs I/O in 4KB chunks and will write a full block even when only one byte needs to be replaced.
I am sure that reads and writes are working correctly when there's no offset because I created a file that is the same size as my block RAM disk and dumped the entire file into my device using dd, then got the output of the device (also using dd), and the input and output files are exactly the same. I also wrote the same file into a brd (Linux kernel original block RAM disk driver) and the outputs are the same comparing my device and the brd device.
BUT -- in some specific situations I try to mount or create a new file system on my device and somehow it gets I/O requests with an offset, and at that point my driver fails. I assume that I'm not handling the offset properly. For example, when I try "mount -t ext2 /dev/esda":
linux-xjwl:/home/phil/escsi # mount /dev/esda -t ext2 /mnt/esda1/
mount: wrong fs type, bad option, bad superblock on /dev/esda,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
linux-xjwl:/home/phil/escsi # dmesg|tail -n 10
[ 2239.275901] ESD RW=0, len=4096, off=0, offset=16384, sector=32
[ 2239.275947] request received nr. sectors = 8
[ 2239.275959] ESD RW=0, len=4096, off=0, offset=4096, sector=8
[ 2239.276516] request received nr. sectors = 8
[ 2239.276537] ESD RW=0, len=4096, off=0, offset=2097152, sector=4096
[ 2239.276606] request received nr. sectors = 8
[ 2239.276626] ESD RW=0, len=4096, off=0, offset=28672, sector=56
[ 2239.277535] request received nr. sectors = 2
[ 2239.277535] ESD RW=0, len=1024, off=1024, offset=2048, sector=2
[ 2239.277535] EXT4-fs (esda): VFS: Can't find ext4 filesystem
(p.s.: the output shows "EXT4" but I am running with "-t ext2")
I have checked the contents of sector n. 2 in my device and it does contain the ext2 metadata (since I ran mkfs.ext2 prior to trying to mount, of course). So I believe there's a problem with the offset. So far I can't really debug my driver because I wasn't able to come up with a request which would cause an I/O request with an offset (e.g., if I try writing a single byte into my device, Linux will read the whole block and rewrite it with only one different byte).
Hope it's not a too simple question for you.
Thanks in advance,
Phil
Please see the answer provided by Peter below.
If you're wondering what the esd_do_bvec() function looks like now, here it comes:
static int esd_do_bvec(struct escsi_dev *esd, char *buf,
unsigned int len, int rw, sector_t sector)
{
int err = 0;
unsigned int offset;
// Please notice that we STILL have an offset to deal with, but
// this offset comes in sectors and needs to be converted to a
// a byte offset.
offset = sector << SECTOR_SHIFT; // or multiply by 512
//printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
if (rw == READ) {
memcpy(buf,esd->data+offset,len);
} else {
memcpy(esd->data+offset,buf,len);
}
return err;
}

The offset per segment does not refer to an offset from the block device location, but rather an offset into the page. To cause this to be nonzero, you'll probably need to write your own C program that runs read() and write(). Allocate a page-aligned buffer, then read/write to/from different locations in that buffer, and those should show up as offsets in the bvec.
That said, LWN warns of managing this page offset manually, and recommends instead the macro bio_kmap_irq(), which is called on the bio_for_each_segment() variable bio, and takes care of the atomic kmap AND manages the offset entry as well. Source: http://lwn.net/Articles/26404/
Your code will look something like:
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
unsigned long flags;
char *buf = bio_kmap_irq(bio, &flags);
err = esd_do_bvec(esd, buf, len, rw, sector);
bio_kunmap_irq(buf, &flags);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
Of course this changes the signature of esd_do_bvec to accept the memory buffer directly rather than page/offset.

Uncompress a gzip File in Memory Using zlib Version 1.1.3

I have a gzip file that is in memory, and I would like to uncompress it using zlib, version 1.1.3. Uncompress() is returning -3, Z_DATA_ERROR, indicating the source data is corrupt.
I know that my in memory buffer is correct - if I write the buffer out to a file, it is the same as my source gzip file.
The gzip file format indicates that there is a 10 byte header, optional headers, the data, and a footer. Is it possible to determine where the data starts, and strip that portion out? I performed a search on this topic, and a couple people have suggested using inflateInit2(). However, in my version of zlib, that function is oddly commented out. Is there any other options?

I came across the same problem, other zlib version (1.2.7)
I don't know why inflateInit2() is commented out.
Without calling inflateInit2 you can do the following:
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
the inflateReset2 is also called by inflateInit. Inside of inflateInit the WindowBits are set to 15 (1111 binary). But you have to set them to 31 (11111) to get gzip working.
The reason is here:
inside of inflateReset2 the following is done:
wrap = (windowBits >> 4) + 1;
which leads to 1 if window bits are set 15 (1111 binary) and to 2 if window bits are set 31 (11111)
Now if you call inflate() the following line in the HEAD state checks the state->wrap value along with the magic number for gzip
if ((state->wrap & 2) && hold == 0x8b1f) { /* gzip header */
So with the following code I was able to do in-memory gzip decompression:
(Note: this code presumes that the complete data to be decompressed is in memory and that the buffer for decompressed data is large enough)
int err;
z_stream d_stream; // decompression stream
d_stream.zalloc = (alloc_func)0;
d_stream.zfree = (free_func)0;
d_stream.opaque = (voidpf)0;
d_stream.next_in = deflated; // where deflated is a pointer the the compressed data buffer
d_stream.avail_in = deflatedLen; // where deflatedLen is the length of the compressed data
d_stream.next_out = inflated; // where inflated is a pointer to the resulting uncompressed data buffer
d_stream.avail_out = inflatedLen; // where inflatedLen is the size of the uncompressed data buffer
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
err = inflateEnd(&d_stream);
Just commenting in inflateInit2() is the oder solution. Here you can set WindowBits directly

Is it possible to determine where the data starts, and strip that portion out?
Gzip has the following magic number:
static const unsigned char gzipMagicBytes[] = { 0x1f, 0x8b, 0x08, 0x00 };
You can read through a file stream and look for these bytes:
static const int testElemSize = sizeof(unsigned char);
static const int testElemCount = sizeof(gzipMagicBytes);
const char *fn = "foo.bar";
FILE *fp = fopen(fn, "rbR");
char testMagicBuffer[testElemCount] = {0};
unsigned long long testMagicOffset = 0ULL;
if (fp != NULL) {
do {
if (memcmp(testMagicBuffer, gzipMagicBytes, sizeof(gzipMagicBytes)) == 0) {
/* we found gzip magic bytes, do stuff here... */
fprintf(stdout, "gzip stream found at byte offset: %llu\n", testMagicOffset);
break;
}
testMagicOffset += testElemSize * testElemCount;
fseek(fp, testMagicOffset - testElemCount + 1, SEEK_SET);
testMagicOffset -= testElemCount + 1;
} while (fread(testMagicBuffer, testElemSize, testElemCount, fp));
}
fclose(fp);
Once you have the offset, you could do copy and paste operations, or overwrite other bytes, etc.

Techniques for handling short reads/writes with scatter-gather?

Scatter-gather - readv()/writev()/preadv()/pwritev() - reads/writes a variable number of iovec structs in a single system call. Basically it reads/write each buffer sequentially from the 0th iovec to the Nth. However according to the documentation it can also return less on the readv/writev calls than was requested. I was wondering if there is a standard/best practice/elegant way to handle that situation.
If we are just handling a bunch of character buffers or similar this isn't a big deal. But one of the niceties is using scatter-gather for structs and/or discrete variables as the individual iovec items. How do you handle the situation where the readv/writev only reads/writes a portion of a struct or half of a long or something like that.
Below is some contrived code of what I am getting at:
int fd;
struct iovec iov[3];
long aLong = 74775767;
int aInt = 949;
char aBuff[100]; //filled from where ever
ssize_t bytesWritten = 0;
ssize_t bytesToWrite = 0;
iov[0].iov_base = &aLong;
iov[0].iov_len = sizeof(aLong);
bytesToWrite += iov[0].iov_len;
iov[1].iov_base = &aInt;
iov[1].iov_len = sizeof(aInt);
bytesToWrite += iov[1].iov_len;
iov[2].iov_base = &aBuff;
iov[2].iov_len = sizeof(aBuff);
bytesToWrite += iov[2].iov_len;
bytesWritten = writev(fd, iov, 3);
if (bytesWritten == -1)
{
//handle error
}
if (bytesWritten < bytesToWrite)
//how to gracefully continue?.........

Use a loop like the following to advance the partially-processed iov:
for (;;) {
written = writev(fd, iov+cur, count-cur);
if (written < 0) goto error;
while (cur < count && written >= iov[cur].iov_len)
written -= iov[cur++].iov_len;
if (cur == count) break;
iov[cur].iov_base = (char *)iov[cur].iov_base + written;
iov[cur].iov_len -= written;
}
Note that if you don't check for cur < count you will read past the end of iov which might contain zero.

AFAICS the vectored read/write functions work the same wrt short reads/writes as the normal ones. That is, you get back the number of bytes read/written, but this might well point into the middle of a struct, just like with read()/write(). There is no guarantee that the possible "interruption points" (for lack of a better term) coincide with the vector boundaries. So unfortunately the vectored IO functions offer no more help for dealing with short reads/writes than the normal IO functions. In fact, it's more complicated since you need to map the byte count into an IO vector element and offset within the element.
Also note that the idea of using vectored IO for individual structs or data items might not work that well; the max allowed value for the iovcnt argument (IOV_MAX) is usually quite small, something like 1024 or so. So if you data is contiguous in memory, just pass it as a single element rather than artificially splitting it up.

Vectored write will write all the data you have provided with one call to "writev" function. So byteswritten will be always be equal to total number of bytes provided as input. this is what my understanding is.
Please correct me if I am wrong

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Using bzip2 low-level routines to compress chunks of data - c

In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly: bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out; The rest of the calculations follow.

Related

C. Loop compression + send (gzip) ZLIB

Get the character dominant from a string

Block ram disk fails to read/write with offset

Uncompress a gzip File in Memory Using zlib Version 1.1.3

Techniques for handling short reads/writes with scatter-gather?

Categories

Resources