How to skip detection of random data when attempting to compress? - c

Do popular compressors such as gzip, 7z, or others using deflate, detect random data strings and skip attempting to compress said strings for sake of speed?
If so, can I switch off this setting?
Otherwise, how can I implement deflate to attempt to compress a random data string?
I've found zlib deflate, and it does not mention the word "random" in the source, however, I'm concerned that higher up in zlib that it detects a random block of bits/bytes and skips over it, overriding deflate.
How can I be sure that a compressor, such as zlib, attempts to compress a block of random data?
Can you give an example command-line expression or code?

Unless you request level 0 (no compression), zlib always attempts to compress the data. For every deflate block, it compares the size of that block using dynamic codes, static codes, and stored (no compression), and emits the smallest of the three. That is all.
There is no detection of "random" data, even if such a thing were possible. (It's not possible, of course. For example, encrypted data is definitely not random, but is indistinguishable from random data if you don't know how to decrypt it.)

Related

Deflate Format: differences between type blocks

I am currently trying to write a compressor and decompressor with the same purpose as the RFC Deflate specification.
I'm not able to understand the difference between how blocks are composed in the compression with fixed tables and dynamic tables. The file is processed by LZ77 generating (distance, length) + literal.
How do I know the type of block?
Do I have to compress this data?
Given that I use a fixed compression and don't have to send the tables, how would the encoder know how to encode data?
Moreover, do I have to send data before the actual compression executes?
I am confused on the difference between fixed tables and the table we send in the dynamic mode, and how the two blocks use them to encode data.
I'm currently reading Data Compression: The Complete Reference. Any advice will be helpful.
Since you are trying to compress, you would pick the smaller of the two. zlib's deflate computes what the size of a fixed block, a dynamic block, and a stored block would be, and emits the smallest of the three.
If you are encoding a fixed block, you encode using the fixed code for literal/lengths and distances. This code is provided in the RFC.

Zlib minimum deflate size

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.
is there some way to GUARANTEE that the output compressed data will be
SMALLER than the input data?
Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.
(Also, I know about deflateBound, but would rather avoid it because of
the overhead.)
Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.
By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.
due to the overhead of things like headers and dictionaries in the LZW
stream.
deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.
Or, to put it another way, is there some minimum buffer size that I
can use for input/output buffers ...
The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.

find and replace data on gzip content efficiently

my c linux based program inputs are:
char *in_str, char *find_str, char *replacing_str
the in_str is a compressed data (gzip).
the program needs to find for the find_str within the uncompressed input data, replace it with replacing_str, and then to recompress the data.
the trivial way to do so is by using one of the many available gzip compress/uncompress libraries to uncompress the data, manipulate the uncompressed data, and then to recompress the output. However i need to make it as efficient as possible (it is a RT program).
i wonder if it is more efficient to use an on-the-fly library (e.g. zlibc) approach or simply do the operation as described above.
maybe it is important to mention that:
the find_str and replacing_str strings are a small portion of the data
their lengths are not equal
the find_str supposed to appear about 4 or 5 times
the uncompressed data len is ~2K - 6K bytes
does anyone familiar with an efficient way to implement this?
Thanks
You are going to have to decompress no matter what, in order to search for the strings. (You might be able to get away with doing that only once and building an index. However that might be much larger than the uncompressed data, so you might as well just store it uncompressed instead.)
You can avoid recompressing all of it by preparing the gzip file ahead of time to be compressed in smaller historyless units using, for example, the Z_FULL_FLUSH option of zlib. This will reduce compression slightly depending on how often you do it, but will speed up building the output greatly if only one of many blocks need to be recompressed.

In what situation would compressed data be larger than input?

I need to handle compression of data that's largely UTF-8 HTML content in a utility I'm working on. The utility uses zLib and the deflate algorithm to compress data. Is it safe to assume that if the input data size is over 1 kB, compressed data will always be smaller than uncompressed input? (Input data below 1 kB will not be compressed.)
I'm trying to see situations where this assumption would break but apart from near-perfect random input, it seems a safe assumption to me.
Edit: the reason I'm wondering about this assumption is because I already have a buffer allocated that's as big as the input data. If my assumption holds, I can reuse this same buffer and avoid another memory allocation.
No. You can never assume that the compressed data will always be smaller. In fact, if any sequence is compressed by the algorithm, then you are guaranteed that some other sequence is expanded.
You can use zlib's deflate() function to compress as much as it can into your 1K buffer. Do whatever you need to with that result, then continue with another deflate() call writing into that same buffer.
Alternatively you can allocate a buffer big enough for the largest expansion. The deflateBound() or compressBound() functions will tell you how much that is. It's only a small amount more.
As far as I know, a sequence of 128 bytes with values 0, 1, 2, ..., 127 will not be compressed by zLib. Technically, it's possible to intentionally create an HTML page that will break your compression scheme, but with normal innocent HTML data you should be almost perfectly safe.
But almost perfectly is not perfectly. If you already have a buffer of that size, I'd advise to attempt the compression with this buffer, and if it turns out that the buffer is not enough (I suppose zLib has means of indicating that), then allocate a larger buffer or simply store an uncompressed version. And make sure you are writing these cases into some log so you could see if it ever fires :)

Resume DEFLATE decompression from flush point

This a question specific to the DEFLATE algorithm, but relates to gzip and zlib.
Suppose I have a gzip file that I know has several flush points in the file. Some of which are made with Z_SYNC_FLUSH and other Z_FULL_FLUSH. If I scan through the file, I can find all the flush points because they immediately follow a pattern of 0000ffff.
I know that I can resume decompression at a Z_FULL_FLUSH points because all the information needed to decompress is available (IE: The dictionary is reset). However, if I try to decompress from a Z_SYNC_FLUSH, I usually get a "zlib.error: Error -3 while decompressing: invalid distance too far back" error.
The question is this: If I try to decompress from a Z_SYNC_FLUSH point, am I guaranteed to either:
Properly decompress that block and subsequent blocks
Fail with "distance too far" error
In other words, am I guaranteed that I will never silently decompress with bad data (I'm not talking about the CRC32 check at the end of the gzip, but whether zlib will loudly complain)?
Assumptions:
Assume that I am able to identify flush points perfectly. Let's pretend that I don't mis-identify random bits as the sync marker nor that the pattern so happens to appear in a type 0 block. This is unrealistic, but just assume it's true.
Assume the file is never corrupted and is always a legitimate gzip file.
If a Z_SYNC_FLUSH results in a subsequent stream that does not give a distance-too-far error, then it is, by accident, equivalent to and indistinguishable from a Z_FULL_FLUSH.
I would not expect this to happen very often.

Resources