C. Loop compression + send (gzip) ZLIB - c

I'm currently building an HTTP server in C.
Please mind this piece of code :
#define CHUNK 0x4000
z_stream strm;
unsigned char out[CHUNK];
int ret;
char buff[200];
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
int windowsBits = 15;
int GZIP_ENCODING = 16;
ret = deflateInit2(&strm, Z_BEST_SPEED, Z_DEFLATED, windowsBits | GZIP_ENCODING, 1,
Z_DEFAULT_STRATEGY);
fill(buff); //fill buff with infos
do {
strm.next_in = (z_const unsigned char *)buff;
strm.avail_in = strlen(buff);
do {
strm.avail_out = CHUNK;
strm.next_out = out;
ret = deflate(&strm, Z_FINISH);
} while (strm.avail_out == 0);
send_to_client(out); //sending a part of the gzip encoded string
fill(buff);
}while(strlen(buff)!=0);
The idea is : I'm trying to send gzip'ed buffers, one by one, that (when they're concatened) is a whole body request.
BUT : for now, my client (a browser) only get the infos of the first buffer. No errors at all though.
How do I achieve this job, how to gzip some buffers inside a loop so I can send them everytime (in the loop) ?

First off, you need to do something with the generated deflate data after each deflate() call. Your code discards the compressed data generated in the inner loop. From this example, after the deflate() you would need something like:
have = CHUNK - strm.avail_out;
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
That's where your send_to_client needs to be, sending have bytes.
In your case, your CHUNK is so much larger than your buff, that loop is always executing only once, so you are not discarding any data. However that is only happening because of the Z_FINISH, so when you make the next fix, you will be discarding data with your current code.
Second, you are finishing the deflate() stream each time after no more than 199 bytes of input. This will greatly limit how much compression you can get. Furthermore, you are sending individual gzip streams, for which most browsers will only interpret the first one. (This is actually a bug in those browsers, but I don't imagine they will be fixed.)
You need to give the compressor at least 10's to 100's of Kbytes to work with in order get decent compression. You need to use Z_NO_FLUSH instead of Z_FINISH until you get to your last buff you want to send. Then you use Z_FINISH. Again, take a look at the example and read the comments.

Related

When is it valid to call inflateSetDictionary() when trying to decompress raw deflate data that was compressed with a dictionary?

The Problem
When is it valid to call inflateSetDictionary() when trying to decompress raw deflate data that was compressed with a compression dictionary?
According to the zlib manual, it is stated that inflateSetDictionary() can be called "at any time". However, it is unclear to me what "at any time" actually means. If we are allowed to call inflateSetDictionary() "at any time", then I interpret it as being valid to call inflateSetDictionary() after calling inflate(). However, doing so results in inflate() returning an "invalid distance too far back" error.
My Code
I created a simple application to compress the string "hello" using raw deflate, with a compression dictionary that also consists of the byte sequence "hello":
#define BUF_SIZE 16384
#define WINDOW_BITS -15 // Negative for raw.
#define MEM_LEVEL 8
const unsigned char dictionary[] = "hello";
unsigned char uncompressed[BUF_SIZE] = "hello";
unsigned char compressed[BUF_SIZE];
z_stream deflate_stream;
deflate_stream.zalloc = Z_NULL;
deflate_stream.zfree = Z_NULL;
deflate_stream.opaque = Z_NULL;
deflateInit2(&deflate_stream,
Z_DEFAULT_COMPRESSION,
Z_DEFLATED,
WINDOW_BITS,
MEM_LEVEL,
Z_DEFAULT_STRATEGY);
deflateSetDictionary(&deflate_stream, dictionary, sizeof(dictionary));
deflate_stream.avail_in = (uInt)strlen(uncompressed) + 1;
deflate_stream.next_in = (Bytef *)uncompressed;
deflate_stream.avail_out = BUF_SIZE;
deflate_stream.next_out = (Bytef *)compressed;
deflate(&deflate_stream, Z_FINISH);
deflateEnd(&deflate_stream);
This produced 4 bytes of raw deflate data into the compressed buffer:
uLong compressed_size = deflate_stream.total_out;
printf("Compressed size is: %lu\n", compressed_size); // prints Compressed size is: 4
I then attempted to decompress this data back into the string "hello". The zlib manual states that I would need to use raw inflate to decompress raw deflate data:
unsigned char decompressed[BUF_SIZE];
z_stream inflate_stream;
inflate_stream.zalloc = Z_NULL;
inflate_stream.zfree = Z_NULL;
inflate_stream.opaque = Z_NULL;
inflateInit2(&inflate_stream, WINDOW_BITS);
inflate_stream.avail_in = (uInt)compressed_size;
inflate_stream.next_in = (Bytef *)compressed;
inflate_stream.avail_out = BUF_SIZE;
inflate_stream.next_out = (Bytef *)decompressed;
int r = inflate(&inflate_stream, Z_FINISH);
According to the zlib manual, I would expect that inflate() should return Z_NEED_DICT, and I would then call inflateSetDictionary() with a subsequent call to inflate():
// Must be called immediately after a call of inflate, if that call returned Z_NEED_DICT.
if (r == Z_NEED_DICT) {
inflateSetDictionary(&inflate_stream, dictionary, sizeof(dictionary));
r = inflate(&inflate_stream, Z_FINISH);
}
if (r != Z_STREAM_END) {
printf("inflate: %s\n", inflate_stream.msg);
return 1;
}
inflateEnd(&inflate_stream);
printf("Decompressed size is: %lu\n", strlen(decompressed));
printf("Decompressed string is: %s\n", decompressed);
However, what ends up happening is that inflate() will not return Z_NEED_DICT, and instead return Z_DATA_ERROR, with the value of inflate_stream.msg being set to "invalid distance too far back".
Even if I were to adjust my code so that inflateSetDictionary() is called regardless of the return value of inflate(), the subsequent inflate() call will still fail with Z_DATA_ERROR due to "invalid distance too far back".
My Question
So far, my code works correctly if I were to use the default zlib encoding by setting WINDOW_BITS to 15, as opposed to -15 for the raw encoding.
My code also works correctly if I were to move the call for inflateSetDictionary() before the call to inflate().
However, it's not clear to me why my existing code does not allow inflate() to return Z_NEED_DICT, so that I can make a subsequent call to inflateSetDictionary().
Is there a mistake in my code somewhere that is preventing inflate() from returning Z_NEED_DICT? Or can inflateSetDictionary() only be called prior to inflate() for the raw encoding, contrary to what the zlib manual states?
inflate() will only return Z_NEED_DICT for a zlib stream, where the need for a dictionary is indicated by a bit in the zlib header, followed by the Adler-32 of the dictionary that was used for compression to verify or select the dictionary. There is no such indication in a raw deflate stream. There is no way for inflate() to know from a raw deflate stream whether or not the data was compressed with a dictionary. It is up to you to know what is needed for decompression, since you made the raw deflate stream in the first place.
Since you did a deflateSetDictionary() before compressing anything, it is up to you to do an inflateSetDictionary() at the same place, before you decompress, after the inflateInit(). As you have found, you need to insert:
inflateSetDictionary(&inflate_stream, dictionary, sizeof(dictionary));
right after the inflateInit(). Then decompression will be successful.
Yes, you can do inflateSetDictionary() at any point during a raw deflate decompression. However it will only work if you are doing it at the same point at which you did the corresponding deflateSetDictionary() when compressing.

Adding header to zlib compressed file

I am trying to use zlib to deflate (compress?) data from a textfile.
It seems to work when I compress a file, but I am trying to prepend
the zlib compressed file with custom header. Both the file and header
should be compressed. However, when I add the header, the length of
the compressed (deflated) file is much shorter than expected and comes
out as an invalid zlib compressed object.
The code works great, until I add the header block of code between the
XXX comments below.
The "FILE *source" variable is a sample file, I typically use
/etc/passwd and the "char *header" is "blob 2172\0".
Without the header block, the output is 904 bytes and deflatable
(decompressable), but with the header it comes out to only 30 bytes.
It also comes out as an invalid zlib object with the header block of
code.
Any ideas where I am making a mistake, specifically why the output is
invalid and shorter with the header?
If its relevant, I am writing this on FreeBSD.
#define Z_CHUNK16384
#define HEX_DIGEST_LENGTH 257
int
zcompress_and_header(FILE *source, char *header)
{
int ret, flush;
z_stream strm;
unsigned int have;
unsigned char in[Z_CHUNK];
unsigned char out[Z_CHUNK];
FILE *dest = stdout; // This is a temporary test
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
ret = deflateInit(&strm, Z_BEST_SPEED);
//ret = deflateInit2(&strm, Z_BEST_SPEED, Z_DEFLATED, 15 | 16, 8,
Z_DEFAULT_STRATEGY);
if (ret != Z_OK)
return ret;
/* XXX Beginning of writing the header */
strm.next_in = (unsigned char *) header;
strm.avail_in = strlen(header) + 1;
do {
strm.avail_out = Z_CHUNK;
strm.next_out = out;
if (deflate (& strm, Z_FINISH) < 0) {
fprintf(stderr, "returned a bad status of.\n");
exit(0);
}
have = Z_CHUNK - strm.avail_out;
fwrite(out, 1, have, stdout);
} while(strm.avail_out == 0);
/* XXX End of writing the header */
do {
strm.avail_in = fread(in, 1, Z_CHUNK, source);
if (ferror(source)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
flush = feof(source) ? Z_FINISH : Z_NO_FLUSH;
strm.next_in = in;
do {
strm.avail_out = Z_CHUNK;
strm.next_out = out;
ret = deflate(&strm, flush);
have = Z_CHUNK - strm.avail_out;
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
} while(strm.avail_out == 0);
} while (flush != Z_FINISH);
} // End of function
deflate is not an archiver. It only compresses a stream. Once the stream is exhausted, your options are very limited. The manual clearly says that
If the parameter flush is set to Z_FINISH, pending input is processed, pending output is flushed and deflate returns with Z_STREAM_END if there was enough output space. If deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error. After deflate has returned Z_STREAM_END, the only possible operations on the stream are deflateReset or deflateEnd.
However, you are calling deflate for the file after you Z_FINISH the header, and zlib behaves unpredictably. The likely fix is to not use Z_FINISH for the header at all, and let the other side understand that the first line in the decompressed string is a header (or impose some archiving protocol understood by both sides).
Your first calls of deflate() should use Z_NO_FLUSH, not Z_FINISH. Z_FINISH should only be used when the last of the data to be compressed is provided with the deflate() call.

Reading and storing data from a TCP buffer in C

I am reading some HTTP POST data from an HTTP wrapped TCP socket. My apparatus works but there is a strange syndrome. Basically I know what the content length is (via HTTP header Content-length) but I more often than not seem to build up a buffer that is 2-3 bytes longer than expected. I know that I am not setting my buffer size on initialization but when I do I get a lot of compile errors. The following code almost works but often produces more data in the buffer than there should be.
long bytesRead;
unsigned long bytesRemaining;
sbyte *pBuffer;
sbyte *pTmpBuffer;
pBuffer = malloc(contentLength);
memset(pBuffer, 0, contentLength);
pTmpBuffer = pBuffer;
bytesRemaining = contentLength;
while(bytesRemaining > 0){
if(maxBuffSize < bytesRemaining){
chunkSize = maxBuffSize;
}
else {
chunkSize = bytesRemaining;
}
bytesRead = tcpBlockReader(pHttpData, pTempBuff, chunkSize);
bytesRemaining -= bytesRead;
pTempBuff += bytesRead;
}
printf("Data is %s\n", pBuffer);
printf("Length is %d\n", strlen(pBuffer));
Now sometimes it will be perfect, ie
Data is expected+data
Length is 13
And sometimes it will be
Data is expected+data+(weird characters)
Length is 15
So the problem here I think is I don't set a size for the buffer (ie pBuffer[contentLength]). When I do this though I get errors of incompatible types and what not. I am not a well versed C programmer (usually stick to chars and ints). What can I do to ensure that the buffer is not full of extra garbage at the end?
I was missing the elusive NULL terminator.
pBuffer = malloc(contentLength + 1)
...
pBuffer[contentLength] = '\0';

Using bzip2 low-level routines to compress chunks of data

The Overview
I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.
I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).
A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.
Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.
The Problem
The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.
For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).
My rough pseudocode is in two parts.
The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):
bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;
/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0);
/* bzError checking... */
do
{
/* read some bytes into myBuffer... */
/* compress bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
/* error checking... */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError == BZ_OK);
}
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);
Now we wrap up the output:
/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */
/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
/* bzError error checking... */
/* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError != BZ_STREAM_END);
/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);
/* bzError checking... */
The Questions
Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?
I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.
Alternatively, am I not understanding the correct use of the bz_stream state flags?
For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?
There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.
In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:
bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;
The rest of the calculations follow.

Uncompress a gzip File in Memory Using zlib Version 1.1.3

I have a gzip file that is in memory, and I would like to uncompress it using zlib, version 1.1.3. Uncompress() is returning -3, Z_DATA_ERROR, indicating the source data is corrupt.
I know that my in memory buffer is correct - if I write the buffer out to a file, it is the same as my source gzip file.
The gzip file format indicates that there is a 10 byte header, optional headers, the data, and a footer. Is it possible to determine where the data starts, and strip that portion out? I performed a search on this topic, and a couple people have suggested using inflateInit2(). However, in my version of zlib, that function is oddly commented out. Is there any other options?
I came across the same problem, other zlib version (1.2.7)
I don't know why inflateInit2() is commented out.
Without calling inflateInit2 you can do the following:
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
the inflateReset2 is also called by inflateInit. Inside of inflateInit the WindowBits are set to 15 (1111 binary). But you have to set them to 31 (11111) to get gzip working.
The reason is here:
inside of inflateReset2 the following is done:
wrap = (windowBits >> 4) + 1;
which leads to 1 if window bits are set 15 (1111 binary) and to 2 if window bits are set 31 (11111)
Now if you call inflate() the following line in the HEAD state checks the state->wrap value along with the magic number for gzip
if ((state->wrap & 2) && hold == 0x8b1f) { /* gzip header */
So with the following code I was able to do in-memory gzip decompression:
(Note: this code presumes that the complete data to be decompressed is in memory and that the buffer for decompressed data is large enough)
int err;
z_stream d_stream; // decompression stream
d_stream.zalloc = (alloc_func)0;
d_stream.zfree = (free_func)0;
d_stream.opaque = (voidpf)0;
d_stream.next_in = deflated; // where deflated is a pointer the the compressed data buffer
d_stream.avail_in = deflatedLen; // where deflatedLen is the length of the compressed data
d_stream.next_out = inflated; // where inflated is a pointer to the resulting uncompressed data buffer
d_stream.avail_out = inflatedLen; // where inflatedLen is the size of the uncompressed data buffer
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
err = inflateEnd(&d_stream);
Just commenting in inflateInit2() is the oder solution. Here you can set WindowBits directly
Is it possible to determine where the data starts, and strip that portion out?
Gzip has the following magic number:
static const unsigned char gzipMagicBytes[] = { 0x1f, 0x8b, 0x08, 0x00 };
You can read through a file stream and look for these bytes:
static const int testElemSize = sizeof(unsigned char);
static const int testElemCount = sizeof(gzipMagicBytes);
const char *fn = "foo.bar";
FILE *fp = fopen(fn, "rbR");
char testMagicBuffer[testElemCount] = {0};
unsigned long long testMagicOffset = 0ULL;
if (fp != NULL) {
do {
if (memcmp(testMagicBuffer, gzipMagicBytes, sizeof(gzipMagicBytes)) == 0) {
/* we found gzip magic bytes, do stuff here... */
fprintf(stdout, "gzip stream found at byte offset: %llu\n", testMagicOffset);
break;
}
testMagicOffset += testElemSize * testElemCount;
fseek(fp, testMagicOffset - testElemCount + 1, SEEK_SET);
testMagicOffset -= testElemCount + 1;
} while (fread(testMagicBuffer, testElemSize, testElemCount, fp));
}
fclose(fp);
Once you have the offset, you could do copy and paste operations, or overwrite other bytes, etc.

Resources