Resume DEFLATE decompression from flush point - zlib

This a question specific to the DEFLATE algorithm, but relates to gzip and zlib.
Suppose I have a gzip file that I know has several flush points in the file. Some of which are made with Z_SYNC_FLUSH and other Z_FULL_FLUSH. If I scan through the file, I can find all the flush points because they immediately follow a pattern of 0000ffff.
I know that I can resume decompression at a Z_FULL_FLUSH points because all the information needed to decompress is available (IE: The dictionary is reset). However, if I try to decompress from a Z_SYNC_FLUSH, I usually get a "zlib.error: Error -3 while decompressing: invalid distance too far back" error.
The question is this: If I try to decompress from a Z_SYNC_FLUSH point, am I guaranteed to either:
Properly decompress that block and subsequent blocks
Fail with "distance too far" error
In other words, am I guaranteed that I will never silently decompress with bad data (I'm not talking about the CRC32 check at the end of the gzip, but whether zlib will loudly complain)?
Assumptions:
Assume that I am able to identify flush points perfectly. Let's pretend that I don't mis-identify random bits as the sync marker nor that the pattern so happens to appear in a type 0 block. This is unrealistic, but just assume it's true.
Assume the file is never corrupted and is always a legitimate gzip file.

If a Z_SYNC_FLUSH results in a subsequent stream that does not give a distance-too-far error, then it is, by accident, equivalent to and indistinguishable from a Z_FULL_FLUSH.
I would not expect this to happen very often.

Related

How to skip detection of random data when attempting to compress?

Do popular compressors such as gzip, 7z, or others using deflate, detect random data strings and skip attempting to compress said strings for sake of speed?
If so, can I switch off this setting?
Otherwise, how can I implement deflate to attempt to compress a random data string?
I've found zlib deflate, and it does not mention the word "random" in the source, however, I'm concerned that higher up in zlib that it detects a random block of bits/bytes and skips over it, overriding deflate.
How can I be sure that a compressor, such as zlib, attempts to compress a block of random data?
Can you give an example command-line expression or code?
Unless you request level 0 (no compression), zlib always attempts to compress the data. For every deflate block, it compares the size of that block using dynamic codes, static codes, and stored (no compression), and emits the smallest of the three. That is all.
There is no detection of "random" data, even if such a thing were possible. (It's not possible, of course. For example, encrypted data is definitely not random, but is indistinguishable from random data if you don't know how to decrypt it.)

Zlib minimum deflate size

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.
is there some way to GUARANTEE that the output compressed data will be
SMALLER than the input data?
Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.
(Also, I know about deflateBound, but would rather avoid it because of
the overhead.)
Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.
By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.
due to the overhead of things like headers and dictionaries in the LZW
stream.
deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.
Or, to put it another way, is there some minimum buffer size that I
can use for input/output buffers ...
The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.

gzip decompression using zlib library

I have a problem with decompressing some gzip data. I have an array with pointers to dynamically allocated char strings. Each element of this array is one part of the gzip file that I want to uncompress.
The first thing which comes to my mind is to concatenate those strings to one, and then decompress data, but I want to avoid this method because of a lot of copying.
So the question is: Is there any way to decompress data divided into few parts, using zlib library ? I was trying to do it, but when I decompress the first part I get Z_DATA_ERROR - and it's normal, because the data is not complete. Is there any way to "wait" for the rest of data to decompress?
Yes. You can simply call inflate() successively with each of the strings in the appropriate order. for each call of inflate(), you can provide a different pointer and length for the compressed data. Each time, make sure that you first consume all of the uncompressed data generated and that avail_in is zero before moving on to the next chunk of input.
If you are getting a Z_DATA_ERROR that means that either you are not reassembling the original stream correctly, or that the original stream is not a gzip stream.
Note that to decompress a gzip stream, you need to initialize with inflateInit2() and set the parameters appropriately to request gzip decompression.

C File Input/Output for Unknown File Types: File Copying

having some issues with a networking assignment. End goal is to have a C program that grabs a file from a given URL via HTTP and writes it to a given filename. I've got it working fine for most text files, but I'm running into some issues, which I suspect all come from the same root cause.
Here's a quick version of the code I'm using to transfer the data from the network file descriptor to the output file descriptor:
unsigned long content_length; // extracted from HTTP header
unsigned long successfully_read = 0;
while(successfully_read != content_length)
{
char buffer[2048];
int extracted = read(connection,buffer,2048);
fprintf(output_file,buffer);
successfully_read += extracted;
}
As I said, this works fine for most text files (though the % symbol confuses fprintf, so it would be nice to have a way to deal with that). The problem is that it just hangs forever when I try to get non-text files (a .png is the basic test file I'm working with, but the program needs to be able to handle anything).
I've done some debugging and I know I'm not going over content_length, getting errors during read, or hitting some network bottleneck. I looked around online but all the C file i/o code I can find for binary files seems to be based on the idea that you know how the data inside the file is structured. I don't know how it's structured, and I don't really care; I just want to copy the contents of one file descriptor into another.
Can anyone point me towards some built-in file i/o functions that I can bludgeon into use for that purpose?
Edit: Alternately, is there a standard field in the HTTP header that would tell me how to handle whatever file I'm working with?
You are using the wrong tool for the job. fprintf takes a format string and extra arguments, like this:
fprintf(output_file, "hello %s, today is the %d", cstring, dayoftheweek);
If you pass the second argument from an unknown source (like the web, which you are doing) you can accidentally have %s or %d or other format specifiers in the string. Then fprintf will try to read more arguments than it was passed, and cause undefined behaviour.
Use fwrite for this:
fwrite(buffer, 1, extracted, output_file);
A couple things with your code:
For fprintf - you are using the data as the second argument, when in fact it should be the format, and the data should be the third argument. This is why you are getting problems with the % character, and why it is struggling when presented with binary data, because it is expecting a format string.
You need to use a different function, such as fwrite, to output the file.
As a side note this is a bit of a security problem - if you fetch a specially crafted file from the server it is possible to expose random areas of your memory.
In addition to Seth's answer: unless you are using a third-party library for handling all the HTTP stuff, you need to deal with the Transfer-Encoding header and the possible compression, or at least detect them and throw an error if you don't know how to handle that case.
In general, it may (or may not) be a good idea to parse the HTTP response headers, and only if they contain exclusively stuff that you understand should you continue to interpret the data that follows the header.
I bet your program is hanging because it's expecting X bytes but receiving Y instead, with X < Y (most likely, sans compression - but PNG don't compress well with gzip). You'll get chunks [*] of data, with one of the chunks most likely spanning content_length so your condition while(successfully_read != content_length) is always true.
You could try running your program under strace or whatever its equivalent is for your OS, if you want to see how your program continues trying to read data it will never get (because you've likely made an HTTP/1.1 request that holds the connection open, and you haven't made a second request) or has ended (if the server closes the connection, your (repeated) calls to read(2) will just return 0, which leaves your (still true) loop condition unchanged.
If you are sending your program's output to stdout, you may find that it produces no output - this can happen if the resource you are retrieving contains no newline or other flush-forcing control characters. Other stdio buffering regimes may apply when output goes to a file. (For example, the file will remain empty until the stdio buffers have accumulates at least 4096 bytes.)
[*] Then there's also Transfer-Encoding: chunked, as #roland-illig alludes to, which will ruin the exact equivalence between content_length (presumably derived from the eponymous HTTP header) and the actual number of bytes transferred over the socket.
You are opening the file as a text file. Doing so means that the program will add \r\n characters at the end of every write() call. Try opening the file as binary, and those errors in size shall go away.

What happens to a piece of data if you use zlib to decompress it, but it isn't compressed in the first place?

If you decompress data with zlib that isn't compressed, does anything happen?
If it does in fact change the data, how do you check if data is zlib zipped in the first place?
There would need to be a valid header. Extremely unlikely that this could ever happen unless it was an accurately structured (compressed) data stream, so it would be invalid data to inflate.

Resources