C - Decompress Gzipped http response

C - Decompress Gzipped http response - c

I have few issues in decompressing gzipped http response, I separated data part from headers but its gzip header and message contain \0 characters which char * takes as null terminator so the first question is how to extract gzipped chunk ?
I can't use string functions like strcat, strlen because it is compressed gzipped data that contains \0 character at various places within chunk.
I've used libcurl but it is relatively slower than C sockets.
Here is some part of a sample response:
HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Content-Type: text/html; charset=utf-8
P3P: CP="NON UNI COM NAV STA LOC CURa DEVa PSAa PSDa OUR IND"
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 12605
Date: Mon, 05 Mar 2012 11:46:30 GMT
Connection: keep-alive
Set-Cookie: _FP=EM=1; expires=Wed, 05-Mar-2014 11:46:29 GMT; domain=.bing.com; path=/
����ՠ����AU��o�
Sample code:
#define MAXDATASIZE 1024
char *recvData; // Holds entire gzip data
char recvBuff[MAXDATASIZE]; // Holds gzip chunk
int offset=0;
while(1){
recvBytes = recv(sockfd, &recvBuff, MAXDATASIZE-1, 0);
totalRecvBytes += recvBytes;
// get content length, this runs first time only as required
if(!clfnd){
regi = regexec(&clregex, &recvBuff, 3, clmatch, 0);
if(!regi){
strncpy(clarr, recvBuff + clmatch[2].rm_so, clmatch[2].rm_eo-clmatch[2].rm_so);
clarr[clmatch[2].rm_eo-clmatch[2].rm_so] = '\0';
cl = atoi(clarr);
clfnd=1;
regfree(&clregex);
recvData = malloc(cl * sizeof(char));
memset(recvData, 0, sizeof recvData);
}
}
// get data part from 1st iteration, furthur iterations contain only data
if(!datasplit){
int strtidx;
char *datastrt = strstr(&recvBuff, "\r\n\r\n");
if(datastrt != NULL){
strtidx = datastrt - recvBuff + 4;
memcpy(recvData, recvBuff + strtidx, recvBytes-strtidx);
datasplit=1;
offset = recvBytes-strtidx;
}
}
else{
memcpy(recvData + offset, recvBuff, recvBytes);
offset += recvBytes;
}
if (offset >= cl)
break;
}
char *outData = malloc(offset*4 * sizeof(char));
memset(outData, 0, sizeof outData);
int ret = inf(recvData, offset, outData, offset*4);
Inflate function:
int inf(const char *src, int srcLen, const char *dst, int dstLen){
z_stream strm;
strm.zalloc=NULL;
strm.zfree=NULL;
strm.opaque=NULL;
strm.avail_in = srcLen;
strm.avail_out = dstLen;
strm.next_in = (Bytef *)src;
strm.next_out = (Bytef *)dst;
int err=-1, ret=-1;
err = inflateInit2(&strm, MAX_WBITS+16);
if (err == Z_OK){
err = inflate(&strm, Z_FINISH);
if (err == Z_STREAM_END){
ret = strm.total_out;
}
else{
inflateEnd(&strm);
return err;
}
}
else{
inflateEnd(&strm);
return err;
}
inflateEnd(&strm);
printf("%s\n", dst);
return err;
}

No, the type char * says nothing about the contents it points to, nor does it interpret any value as a terminator. The str* functions, on the other hand has an assumption about how strings are represented, and can not be used on binary data, or even text data that has a different representation.
Decompression can be rather complex, but you can have a look at zlib, whcih should be able to help you out.

Content-Length: 12605 means that the gzipped file has a size of 12605 bytes. So just copy 12605 bytes after the message header to a local buffer and give that buffer to the decompression function. Also I am not sure if your socket reading function reads the whole 12605 in one flow. If not, you need to append the rest of the data in the next read to this local buffer and when 12605 bytes are read and then call the decompression function.
There is no problem in using char* as buffer. The issue ur facing is because ur trying to print the gzip data as string.

The beginning of the HTTP payload starts after the "\r\n\r\n" (after the HTTP header).
Use the HTTP field "Content-Length" to get the HTTP payload size.
With this information, you have to create function to decompress de data. With Zlib you can do that.
PS. pay attention if it's using raw format or zlib with header and trailers. Usualy HTTP uses header and trailer, and IMAP4 uses raw format.

Related

When is it valid to call inflateSetDictionary() when trying to decompress raw deflate data that was compressed with a dictionary?

The Problem
When is it valid to call inflateSetDictionary() when trying to decompress raw deflate data that was compressed with a compression dictionary?
According to the zlib manual, it is stated that inflateSetDictionary() can be called "at any time". However, it is unclear to me what "at any time" actually means. If we are allowed to call inflateSetDictionary() "at any time", then I interpret it as being valid to call inflateSetDictionary() after calling inflate(). However, doing so results in inflate() returning an "invalid distance too far back" error.
My Code
I created a simple application to compress the string "hello" using raw deflate, with a compression dictionary that also consists of the byte sequence "hello":
#define BUF_SIZE 16384
#define WINDOW_BITS -15 // Negative for raw.
#define MEM_LEVEL 8
const unsigned char dictionary[] = "hello";
unsigned char uncompressed[BUF_SIZE] = "hello";
unsigned char compressed[BUF_SIZE];
z_stream deflate_stream;
deflate_stream.zalloc = Z_NULL;
deflate_stream.zfree = Z_NULL;
deflate_stream.opaque = Z_NULL;
deflateInit2(&deflate_stream,
Z_DEFAULT_COMPRESSION,
Z_DEFLATED,
WINDOW_BITS,
MEM_LEVEL,
Z_DEFAULT_STRATEGY);
deflateSetDictionary(&deflate_stream, dictionary, sizeof(dictionary));
deflate_stream.avail_in = (uInt)strlen(uncompressed) + 1;
deflate_stream.next_in = (Bytef *)uncompressed;
deflate_stream.avail_out = BUF_SIZE;
deflate_stream.next_out = (Bytef *)compressed;
deflate(&deflate_stream, Z_FINISH);
deflateEnd(&deflate_stream);
This produced 4 bytes of raw deflate data into the compressed buffer:
uLong compressed_size = deflate_stream.total_out;
printf("Compressed size is: %lu\n", compressed_size); // prints Compressed size is: 4
I then attempted to decompress this data back into the string "hello". The zlib manual states that I would need to use raw inflate to decompress raw deflate data:
unsigned char decompressed[BUF_SIZE];
z_stream inflate_stream;
inflate_stream.zalloc = Z_NULL;
inflate_stream.zfree = Z_NULL;
inflate_stream.opaque = Z_NULL;
inflateInit2(&inflate_stream, WINDOW_BITS);
inflate_stream.avail_in = (uInt)compressed_size;
inflate_stream.next_in = (Bytef *)compressed;
inflate_stream.avail_out = BUF_SIZE;
inflate_stream.next_out = (Bytef *)decompressed;
int r = inflate(&inflate_stream, Z_FINISH);
According to the zlib manual, I would expect that inflate() should return Z_NEED_DICT, and I would then call inflateSetDictionary() with a subsequent call to inflate():
// Must be called immediately after a call of inflate, if that call returned Z_NEED_DICT.
if (r == Z_NEED_DICT) {
inflateSetDictionary(&inflate_stream, dictionary, sizeof(dictionary));
r = inflate(&inflate_stream, Z_FINISH);
}
if (r != Z_STREAM_END) {
printf("inflate: %s\n", inflate_stream.msg);
return 1;
}
inflateEnd(&inflate_stream);
printf("Decompressed size is: %lu\n", strlen(decompressed));
printf("Decompressed string is: %s\n", decompressed);
However, what ends up happening is that inflate() will not return Z_NEED_DICT, and instead return Z_DATA_ERROR, with the value of inflate_stream.msg being set to "invalid distance too far back".
Even if I were to adjust my code so that inflateSetDictionary() is called regardless of the return value of inflate(), the subsequent inflate() call will still fail with Z_DATA_ERROR due to "invalid distance too far back".
My Question
So far, my code works correctly if I were to use the default zlib encoding by setting WINDOW_BITS to 15, as opposed to -15 for the raw encoding.
My code also works correctly if I were to move the call for inflateSetDictionary() before the call to inflate().
However, it's not clear to me why my existing code does not allow inflate() to return Z_NEED_DICT, so that I can make a subsequent call to inflateSetDictionary().
Is there a mistake in my code somewhere that is preventing inflate() from returning Z_NEED_DICT? Or can inflateSetDictionary() only be called prior to inflate() for the raw encoding, contrary to what the zlib manual states?

inflate() will only return Z_NEED_DICT for a zlib stream, where the need for a dictionary is indicated by a bit in the zlib header, followed by the Adler-32 of the dictionary that was used for compression to verify or select the dictionary. There is no such indication in a raw deflate stream. There is no way for inflate() to know from a raw deflate stream whether or not the data was compressed with a dictionary. It is up to you to know what is needed for decompression, since you made the raw deflate stream in the first place.
Since you did a deflateSetDictionary() before compressing anything, it is up to you to do an inflateSetDictionary() at the same place, before you decompress, after the inflateInit(). As you have found, you need to insert:
inflateSetDictionary(&inflate_stream, dictionary, sizeof(dictionary));
right after the inflateInit(). Then decompression will be successful.
Yes, you can do inflateSetDictionary() at any point during a raw deflate decompression. However it will only work if you are doing it at the same point at which you did the corresponding deflateSetDictionary() when compressing.

Adding header to zlib compressed file

I am trying to use zlib to deflate (compress?) data from a textfile.
It seems to work when I compress a file, but I am trying to prepend
the zlib compressed file with custom header. Both the file and header
should be compressed. However, when I add the header, the length of
the compressed (deflated) file is much shorter than expected and comes
out as an invalid zlib compressed object.
The code works great, until I add the header block of code between the
XXX comments below.
The "FILE *source" variable is a sample file, I typically use
/etc/passwd and the "char *header" is "blob 2172\0".
Without the header block, the output is 904 bytes and deflatable
(decompressable), but with the header it comes out to only 30 bytes.
It also comes out as an invalid zlib object with the header block of
code.
Any ideas where I am making a mistake, specifically why the output is
invalid and shorter with the header?
If its relevant, I am writing this on FreeBSD.
#define Z_CHUNK16384
#define HEX_DIGEST_LENGTH 257
int
zcompress_and_header(FILE *source, char *header)
{
int ret, flush;
z_stream strm;
unsigned int have;
unsigned char in[Z_CHUNK];
unsigned char out[Z_CHUNK];
FILE *dest = stdout; // This is a temporary test
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
ret = deflateInit(&strm, Z_BEST_SPEED);
//ret = deflateInit2(&strm, Z_BEST_SPEED, Z_DEFLATED, 15 | 16, 8,
Z_DEFAULT_STRATEGY);
if (ret != Z_OK)
return ret;
/* XXX Beginning of writing the header */
strm.next_in = (unsigned char *) header;
strm.avail_in = strlen(header) + 1;
do {
strm.avail_out = Z_CHUNK;
strm.next_out = out;
if (deflate (& strm, Z_FINISH) < 0) {
fprintf(stderr, "returned a bad status of.\n");
exit(0);
}
have = Z_CHUNK - strm.avail_out;
fwrite(out, 1, have, stdout);
} while(strm.avail_out == 0);
/* XXX End of writing the header */
do {
strm.avail_in = fread(in, 1, Z_CHUNK, source);
if (ferror(source)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
flush = feof(source) ? Z_FINISH : Z_NO_FLUSH;
strm.next_in = in;
do {
strm.avail_out = Z_CHUNK;
strm.next_out = out;
ret = deflate(&strm, flush);
have = Z_CHUNK - strm.avail_out;
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
} while(strm.avail_out == 0);
} while (flush != Z_FINISH);
} // End of function

deflate is not an archiver. It only compresses a stream. Once the stream is exhausted, your options are very limited. The manual clearly says that
If the parameter flush is set to Z_FINISH, pending input is processed, pending output is flushed and deflate returns with Z_STREAM_END if there was enough output space. If deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error. After deflate has returned Z_STREAM_END, the only possible operations on the stream are deflateReset or deflateEnd.
However, you are calling deflate for the file after you Z_FINISH the header, and zlib behaves unpredictably. The likely fix is to not use Z_FINISH for the header at all, and let the other side understand that the first line in the decompressed string is a header (or impose some archiving protocol understood by both sides).

Your first calls of deflate() should use Z_NO_FLUSH, not Z_FINISH. Z_FINISH should only be used when the last of the data to be compressed is provided with the deflate() call.

C. Loop compression + send (gzip) ZLIB

I'm currently building an HTTP server in C.
Please mind this piece of code :
#define CHUNK 0x4000
z_stream strm;
unsigned char out[CHUNK];
int ret;
char buff[200];
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
int windowsBits = 15;
int GZIP_ENCODING = 16;
ret = deflateInit2(&strm, Z_BEST_SPEED, Z_DEFLATED, windowsBits | GZIP_ENCODING, 1,
Z_DEFAULT_STRATEGY);
fill(buff); //fill buff with infos
do {
strm.next_in = (z_const unsigned char *)buff;
strm.avail_in = strlen(buff);
do {
strm.avail_out = CHUNK;
strm.next_out = out;
ret = deflate(&strm, Z_FINISH);
} while (strm.avail_out == 0);
send_to_client(out); //sending a part of the gzip encoded string
fill(buff);
}while(strlen(buff)!=0);
The idea is : I'm trying to send gzip'ed buffers, one by one, that (when they're concatened) is a whole body request.
BUT : for now, my client (a browser) only get the infos of the first buffer. No errors at all though.
How do I achieve this job, how to gzip some buffers inside a loop so I can send them everytime (in the loop) ?

First off, you need to do something with the generated deflate data after each deflate() call. Your code discards the compressed data generated in the inner loop. From this example, after the deflate() you would need something like:
have = CHUNK - strm.avail_out;
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)deflateEnd(&strm);
return Z_ERRNO;
}
That's where your send_to_client needs to be, sending have bytes.
In your case, your CHUNK is so much larger than your buff, that loop is always executing only once, so you are not discarding any data. However that is only happening because of the Z_FINISH, so when you make the next fix, you will be discarding data with your current code.
Second, you are finishing the deflate() stream each time after no more than 199 bytes of input. This will greatly limit how much compression you can get. Furthermore, you are sending individual gzip streams, for which most browsers will only interpret the first one. (This is actually a bug in those browsers, but I don't imagine they will be fixed.)
You need to give the compressor at least 10's to 100's of Kbytes to work with in order get decent compression. You need to use Z_NO_FLUSH instead of Z_FINISH until you get to your last buff you want to send. Then you use Z_FINISH. Again, take a look at the example and read the comments.

How to Encode and Decode websocket permessage-deflate using zlib library in C?

All
I am trying to encode WebSocket payload using per-message-deflate. I am trying to use some test code but my encoding is not proper can you please help me.
Here is my code:
char a[180] = "{\"type\":\"message\",\"channel\":\"C04U0F9CW\",\"text\":\"TestingMessage\",\"id\":757}";
char b[180];
char c[180];
int i;
// deflate
// zlib struct
z_stream defstream;
defstream.zalloc = Z_NULL;
defstream.zfree = Z_NULL;
defstream.opaque = Z_NULL;
defstream.avail_in = (uInt)strlen(a)+1; // size of input, string + terminator
defstream.next_in = (Bytef *)a; // input char array
defstream.avail_out = (uInt)sizeof(b); // size of output
defstream.next_out = (Bytef *)b; // output char array
//deflateInit(&defstream, Z_SYNC_FLUSH);
deflateInit2(&defstream, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 15, 8, Z_DEFAULT_STRATEGY);
deflate(&defstream, Z_SYNC_FLUSH);
deflateEnd(&defstream);
for(i=0;i<=(char*)defstream.next_out - b;i++)
{
printf("%X",(Bytef)b[i]);
}
printf("\n%s\n",(Bytef *)b);
// This is one way of getting the size of the output
printf("Deflated size is: %lu\n", (char*)defstream.next_out - b);
my output is
kml#kml-VirtualBox:~/zlib-1.2.11/test$ gcc test.c -lz && ./a.out
789CAA562AA92C4855B252CA4D2D2E4E4C4F55D2514ACE48CCCB4BCD18A391B98841AB8593A873454B522B4A804221A9C5259979E9BE70D599294A56E6A6E6B5C0000FFFF2E
x��V*�,HU�R�M-.NLOU�QJ�H��K��9���Y:�EKR+J�B!��%�y��pՙ)JV����
Deflated size is: 72
Inflate:
73
{"type":"message","channel":"C04U0F9CW","text":"TestingMessage","id":757}
So I am expecting payload will me like this
AA562AA92C4855B252CA4D2D2E4E4C4F55D2514ACE48CCCB4BCD018A391B98841AB8593A8703454B522B4A804221A9C5259979E9BE70D599294A56E6A6E6B50000
This is the fiddler WebSocket traffic capture for that particular message
Please help me out to get proper per-message-deflate encoding.
Here is some more upgrade information
GET https://127.0.0.1/websoc.html HTTP/1.1
Host: mpmulti-yklu.slack-msgs.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate
Sec-WebSocket-Key: FriWg0AWhiLhLXXnfJK8kA==
Connection: keep-alive, Upgrade
Pragma: no-cache
Cache-Control: no-cache
Upgrade: websocket
Response
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: S2Q+O/tRM0/mIG//Er2hDdqD0eY=
Sec-Websocket-Extensions: permessage-deflate; server_no_context_takeover; client_no_context_takeover
X-Via: haproxy-ext90

The code is correct but defstream.avail_in expect exact length, not for the extra null character. If you will provide exact input length it will work properly. Try like this
int l = (uInt)strlen(a);
defstream.avail_in = l; // size of input, string + terminator
Please try to understand how z_stream is assigning value check the code then you will better understand about this.
I think many people are trying the decode and they are not getting proper value for per-message-deflate.
Check your output
"789CAA562AA92C4855B252CA4D2D2E4E4C4F55D2514ACE48CCCB4BCD18A391B98841AB8593A873454B522B4A804221A9C5259979E9BE70D599294A56E6A6E6B5C0000FFFF2E"
Here 789C is header and 0000FFFF or according to your code (00FFFF) is the trailer of deflate message. here 0 is for 00 because your output is print one '0' instead of '00'. you are getting 5C extra because of extra one size. Thye '2E' will also will be removed when you will try below code
int d = memcmp(msgend, &b[defstream.total_out - 4], 4) ? 0 : 4;
for(i=0;i<defstream.total_out - d;i++)
{
printf("%2X",(Bytef)b[i]);
}
So your expected answer is
"AA562AA92C4855B252CA4D2D2E4E4C4F55D2514ACE48CCCB4BCD018A391B98841AB8593A8703454B522B4A804221A9C5259979E9BE70D599294A56E6A6E6B50000"
As per your output, 2 bytes are 0 so as per your code it will be '00' and your output follow printing each nibble (half byte).
Please try this is worked for me and I am now able to decode properly.

Uncompress a gzip File in Memory Using zlib Version 1.1.3

I have a gzip file that is in memory, and I would like to uncompress it using zlib, version 1.1.3. Uncompress() is returning -3, Z_DATA_ERROR, indicating the source data is corrupt.
I know that my in memory buffer is correct - if I write the buffer out to a file, it is the same as my source gzip file.
The gzip file format indicates that there is a 10 byte header, optional headers, the data, and a footer. Is it possible to determine where the data starts, and strip that portion out? I performed a search on this topic, and a couple people have suggested using inflateInit2(). However, in my version of zlib, that function is oddly commented out. Is there any other options?

I came across the same problem, other zlib version (1.2.7)
I don't know why inflateInit2() is commented out.
Without calling inflateInit2 you can do the following:
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
the inflateReset2 is also called by inflateInit. Inside of inflateInit the WindowBits are set to 15 (1111 binary). But you have to set them to 31 (11111) to get gzip working.
The reason is here:
inside of inflateReset2 the following is done:
wrap = (windowBits >> 4) + 1;
which leads to 1 if window bits are set 15 (1111 binary) and to 2 if window bits are set 31 (11111)
Now if you call inflate() the following line in the HEAD state checks the state->wrap value along with the magic number for gzip
if ((state->wrap & 2) && hold == 0x8b1f) { /* gzip header */
So with the following code I was able to do in-memory gzip decompression:
(Note: this code presumes that the complete data to be decompressed is in memory and that the buffer for decompressed data is large enough)
int err;
z_stream d_stream; // decompression stream
d_stream.zalloc = (alloc_func)0;
d_stream.zfree = (free_func)0;
d_stream.opaque = (voidpf)0;
d_stream.next_in = deflated; // where deflated is a pointer the the compressed data buffer
d_stream.avail_in = deflatedLen; // where deflatedLen is the length of the compressed data
d_stream.next_out = inflated; // where inflated is a pointer to the resulting uncompressed data buffer
d_stream.avail_out = inflatedLen; // where inflatedLen is the size of the uncompressed data buffer
err = inflateInit(&d_stream);
err = inflateReset2(&d_stream, 31);
err = inflateEnd(&d_stream);
Just commenting in inflateInit2() is the oder solution. Here you can set WindowBits directly

Is it possible to determine where the data starts, and strip that portion out?
Gzip has the following magic number:
static const unsigned char gzipMagicBytes[] = { 0x1f, 0x8b, 0x08, 0x00 };
You can read through a file stream and look for these bytes:
static const int testElemSize = sizeof(unsigned char);
static const int testElemCount = sizeof(gzipMagicBytes);
const char *fn = "foo.bar";
FILE *fp = fopen(fn, "rbR");
char testMagicBuffer[testElemCount] = {0};
unsigned long long testMagicOffset = 0ULL;
if (fp != NULL) {
do {
if (memcmp(testMagicBuffer, gzipMagicBytes, sizeof(gzipMagicBytes)) == 0) {
/* we found gzip magic bytes, do stuff here... */
fprintf(stdout, "gzip stream found at byte offset: %llu\n", testMagicOffset);
break;
}
testMagicOffset += testElemSize * testElemCount;
fseek(fp, testMagicOffset - testElemCount + 1, SEEK_SET);
testMagicOffset -= testElemCount + 1;
} while (fread(testMagicBuffer, testElemSize, testElemCount, fp));
}
fclose(fp);
Once you have the offset, you could do copy and paste operations, or overwrite other bytes, etc.