Compression ratios for different zlib compression levels - zlib

I am considering what level of zlib compression to use, and I am curious about the different compression rates for the different compression levels that can be specified in zlib commands. The zlib manual has the following constants for specifying the compression level:
#define Z_NO_COMPRESSION 0
#define Z_BEST_SPEED 1
#define Z_BEST_COMPRESSION 9
#define Z_DEFAULT_COMPRESSION (-1)
Clearly, a lower the number means lower latency for compression and deflation at the cost of a less-compressed file, while a higher number favors better compression at the cost of higher latency.
My question is, what compression ratios can be expected for the different compression levels? This zlib web page says, More typical zlib compression ratios are on the order of 2:1 to 5:1 in the context of maximum compression, but are there also compression ratios / ranges for the other compression factors as well?

There is no answer for your question that is independent of the data. You can predict neither compression ratios, nor ratios of compression ratios just from the respective compression levels. The levels reflect how hard the compressor looks for matching strings in the previous 32K of data. That's all.

Related

Snowflake: Data loading file size recommendations

https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#general-file-sizing-recommendations
The number of load operations that run in parallel cannot exceed the number of data files to be loaded. To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 100-250 MB (or larger) in size compressed.
I got above details from Snowflake doc, they simply said (or larger) can someone explain what is the max size recommended.
It's a consideration between aggregating smaller files (thus reducing overhead) and splitting larger files into smaller ones (thus distributing the workload and increasing parallelism).
The general size recommendatation to meet the above consideration is 100-250MB. That is what's in the docs. The term "or larger" just means, your best file size in your individual scenario can also be above 250MB, e.g. 300MB, depending on your consideration-results.

decompression algorithms that can work virtually without RAM (LZ like if possible)

edit: I try to rephrase this as to make this clearer the best I can :)
I need to find a suitable way / choose a suitable compression to store a blob of data (say approx. 900KB) in a ROM where the available amount of free space is only about 700KB. If I compress the blob with some modern compression tool (eg. WinZIP/WinRAR) I can achieve the required compression easily.
The matter here is that the decompression will take place on a very VERY VERY limited hardware where I can't afford to have more than few bytes of RAM available (say no more than 100 bytes, for the sake of it).
I already tried RLE'ing the data... the data hardly compress.
While I'm working trying to change the data blob format so that it could potentially have more redundancy and achieve better compression ratio, I'm at the same time seeking a compression method that will enable me to decompress on my limited hardware. I have a limited knowledge of compression algorithms so I'm seeking suggestions/pointers to continue with my hunt.
Thanks!
Original question was "I need info/pointers on decompression algorithms that can work without using the uncompressed data, as this will be unavailable right after decompression. LZ like approaches would still be preferred."
I'm afraid this is off topic because too broad.
LZW uses a sizable state that is not very different from keeping a slice of uncompressed data. Even if the state is constant and read from ROM, handling it with just registers seems difficult. There are many different algorithms than can use a constant state, but if you really have NO RAM, then only the most basic algorithms can be used.
Look up RLE, run length encoding.
EDIT: OK, no sliding window, but if you can access ROM, 100 bytes of RAM give you quite some possibilities. You want to implement this in assembly, so stick with very simple algorithms. RLE plus a dictionary. Given your requirements, the choice of algorithm should be based on the type of data you need to decompress.

Which compression to use between embedded processors (known byte distributions)

I'm working with radio-to-radio communications where bandwidth is really really precious. It's all done with on the metal C code (no OS, small atmel 8bit microprocessors). So the idea of compression becomes appealing for some large, but rare, transmissions.
I'm no compression expert. I've used the command line tools to shrink files and looked at how much I get. And linked a library or two over the years. But never anything this low level.
In one example, I want to move about 28K over the air between processors. If I just do a simple bzip2 -9 on a representative file, I get about 65% of the original size.
But I'm curious if I can do better though. I am (naively?) under the impression that most basic compression formats must be some declaration of metadata up front, that describes how to inflate a bitstream that follows. What I don't know is how much space that metadata itself takes up. I histogram'ed said same file, and a number of other ones, and found that due to the nature of what's being transmitted, the histogram is almost always about the same. So I'm curious if I could hard code these frequencies in my code so that that was no longer dynamic, but also wasn't transmitted as part of the packet.
For example, my understanding of a huffman encoding is that usually there's a "dictionary" up front, followed by a bitstream. And that if a compressor does it by blocks, each block will have its own dictionary.
On top of this, it's a small processor, with a small footprint, I'd like to keep whatever I do small, simple, and straightforward.
So I guess the basic question is, what, if any, basic compression algorithm would you implement in this kind of environment/scenario. Especially taking into account, that you can basically precompile a representative histogram of the bytes per transmission.
What you are suggesting, providing preset frequency data, would help very little. Or more likely it would hurt, since you will take a hit by not using optimal codes. As an example, only about 80 bytes at the start of a deflate block is needed to represent the literal/length and distance Huffman codes. A slight increase in the, say, 18 KB of your compressed data could easily cancel that.
With zlib, you could use a representative one of your 28K messages as a dictionary in which to search for matching strings. This could help the compression quite a bit, if there are many common strings in your messages. See deflateSetDictionary() and inflateSetDictionary().

What can be the least possible value of data-compression-ratio for any real dataset

I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.
Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.
Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.
I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.
Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.
While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)
Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.
EDIT:
#MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.
because the pigeon priciple
http://en.wikipedia.org/wiki/Pigeonhole_principle
you will always have strings that get compressed and strings that get expanded
http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/
theoretically you can achieve best compression with 0 entropy data (infinite compression ratio) and worst compression with infinite entropy data (AWGN noise, so you have 0 compression ratio).
I can't tell from your question whether you're using zlib or not. If you're using zlib, it provides a function, deflateBound(), which does exactly what you're asking for, taking an uncompressed size and returning the maximum compressed size. It takes into account how the deflate stream was initialized with deflateInit() or deflateInit2() in computing the proper header and trailer sizes.
If you're writing your own deflate, then you will already know what the maximum compressed size is based on how often you allow it to use stored blocks.
Update: The only way to know for sure the maximum data expansion of a hardware deflator is to obtain the algorithm used. Then through inspection you can determine how often it will emit stored blocks for random data.
The only alternative is empirical and unreliable. You can feed the hardware compressor random data, and examine the results. You can use infgen to disassemble the deflate output and see the stored blocks and their sizes. Then you can write a linear bounding formula for the expansion. Then add some margin to the additive and multiplicative terms to cover for situations that you did not observe in your tests.
This will only work if the hardware deflate algorithm is well behaved, which means that it will not write a fixed or dynamic deflate block if a stored block would be smaller. If it is not well behaved, then all bets are off.
The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.
This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.
This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

How do you compress captured video in Silverlight?

One of the big deals in Silverlight v4 is audio/video capture... but I haven't found an example yet that does what I want to do. So:
How do you capture audio/video with Silverlight (from a webcam), and then save it as a compressed format (WMV or MP4)? The idea here is to upload it after compression.
Have already looked at this blog post for the capture piece, but need to find a way to compress audio/video for upload.
Silverlight does not support video encoding and more likely this won't be implemented at least by Microsoft. To transmit video over network, some people use "pseudo-MJPEG" codec by compressing individual frames as regular JPEG images. Some people even improved that idea by dividing frames into fixed block (say 8x8), and only transmits changed blocks (with lossy comparison).
If you're a veteran programmer and enjoy coding, here is another slightly improved version of "psuedo-MJPEG" idea:
Divide current frame into fixed 8x8 block
Apply RGB -> YCbCr color space conversion for each block
Down sample Cb and Cr plane by half
Apply DCT to YCbCr
Quantize DCT coefficients with a quantization matrix
Compare this DCT coefficients with previous frame's block. This way you make "perceptually lossy" comparison for each consecutive frames.
Use a bit-wise range-coder and encode a flag for unchanged blocks
For changed blocks, transmit DCT coefficient by modeling them (you can use JPEG's standard zig-zag pattern and zero-run model) and encode them with range coder.
This is more or less a standard JPEG algorithm actually. But, actual advantages over standard JPEG are:
Perceptually lossy comparison for blocks
Stronger compression due to both small overhead and stronger entropy coder (range coder)
Another option could be pay for 3rd party software (sorry, I don't know any free software). I find that product. I didn't used it at all. But, I believe it could be useful for you.

Resources