Which compression to use between embedded processors (known byte distributions) - c

I'm working with radio-to-radio communications where bandwidth is really really precious. It's all done with on the metal C code (no OS, small atmel 8bit microprocessors). So the idea of compression becomes appealing for some large, but rare, transmissions.
I'm no compression expert. I've used the command line tools to shrink files and looked at how much I get. And linked a library or two over the years. But never anything this low level.
In one example, I want to move about 28K over the air between processors. If I just do a simple bzip2 -9 on a representative file, I get about 65% of the original size.
But I'm curious if I can do better though. I am (naively?) under the impression that most basic compression formats must be some declaration of metadata up front, that describes how to inflate a bitstream that follows. What I don't know is how much space that metadata itself takes up. I histogram'ed said same file, and a number of other ones, and found that due to the nature of what's being transmitted, the histogram is almost always about the same. So I'm curious if I could hard code these frequencies in my code so that that was no longer dynamic, but also wasn't transmitted as part of the packet.
For example, my understanding of a huffman encoding is that usually there's a "dictionary" up front, followed by a bitstream. And that if a compressor does it by blocks, each block will have its own dictionary.
On top of this, it's a small processor, with a small footprint, I'd like to keep whatever I do small, simple, and straightforward.
So I guess the basic question is, what, if any, basic compression algorithm would you implement in this kind of environment/scenario. Especially taking into account, that you can basically precompile a representative histogram of the bytes per transmission.

What you are suggesting, providing preset frequency data, would help very little. Or more likely it would hurt, since you will take a hit by not using optimal codes. As an example, only about 80 bytes at the start of a deflate block is needed to represent the literal/length and distance Huffman codes. A slight increase in the, say, 18 KB of your compressed data could easily cancel that.
With zlib, you could use a representative one of your 28K messages as a dictionary in which to search for matching strings. This could help the compression quite a bit, if there are many common strings in your messages. See deflateSetDictionary() and inflateSetDictionary().

Related

Fastest disk based solution to cache trillions of unique md5 hashes

Is there a very low latency disk based caching solution that I can use to store only unique values (NOT key+value)?
My script needs to keep track of which files it has processed so it doesn't redo any work. I need to check the cache to search for the md5 hash of the file, if it doesn't exist, I process the file and add the hash to the cache.
Is there a faster disk based caching solution than using a key-value based solution?
Try LevelDB.
It's a key-value store but is very compact due to the trie structure.
Less space usage => less I/O => better performance.
Not sure about "trillions" (a trillion MD5 hashes would be 16,000 TB), but Bitcoin core as well as Ethereum implementations all use LevelDB.
In your case, there is no need for an "Ordered Key-Value Store". That is you can rely on plain Key-Value stores (direct dbm successors):
Good candidates are:
tokyo cabinet it has a hash-based format, that might be faster in your case.
gdbm
In the case where the datatset fit into memory, you might want to try LMDB.
I do not recomment LevelDB because it is slow.
Do the math. 1 trillion MD5s, without any tricks, would take 16TB of disk space. This is, I assume, far more than your RAM size.
Since each MD5 lookup is essentially a 'random' probe into the disk, there will necessarily be about 1 disk hit per check.
If, say, an SSD read is 1ms, that is 1e9 seconds to insert (or check) a trillion hashes. That's 30 years.
There are a lot flaws in my math, but I think this says that it is not practical today to store and check a trillion of anything random.
If you want to crank it down to a billion MD5s, now we are getting in the range of RAM sizes. But you probably want to have the data persisted? So you really need some database-like tool that will do the persisting for you, while making the checks purely in RAM (CPU-speed).
In any case, I would consider writing code that breaks the MD5 into 2 or 3 chunks, then use the chunks like a directory structure. At the bottom level, you have a variable-length bunch of values for the last chunk. Each is perhaps 8 bytes long. That would need a linear or binary search into a bunch of numbers that are half the size of an MD5. The savings here helps compensate for the various overheads in the rest of the structure, plus the need for writing blocks to disk. Hence, I would still expect needing about 16GB of RAM to house a billion MD5s.
Given that approach, virtually any database engine is already geared up to do most of the work reasonably efficiently. The lowest level would be some type of BLOB containing multiple 8-byte chunks.
Another trick to use... Let's look at just the first 5 bytes of an MD5. There are a trillion different values in 5 bytes. If you have only a billion entries in your dataset, then checking the 5 bytes has a 99.9% chance of correctly saying "the md5 is not in the dataset" versus less than 0.01% chance of saying "the md5 might be in the dataset". In the former case, you get a quick answer with only 5GB for a billion items. In the latter case, you may have to go to disk and be slower. Still the average time is better. This helps with the speed of checking. (But does not address the speed of loading.)

C: Most efficient way to store variables where every bit matters

To start off: this might be a duplicate, but i can't seem to find a definitive answer on this question after having searched for it on google.
For a project i am designing a script that makes 2 ATMega328p chips communicate. At this moment i'm testing the best speed to do this, but my goal is to achieve really high baudrates. I have plenty of experience with making code efficient, but not with the memory management part. The problem:
I want to store a multiple of 8 bits (ex.: 48 bits). My first thought was to use an array of length 6 and type uint8_t, but I don't know how efficient arrays are compared to other types. Some people say pointers are more efficient and others say it doesn't matter, but I cant find a definitive answer on what the case is for really small amounts of memory. last quesion: I know the size of the sent bits will never be bigger than 64 bits, so would it matter if i just Always jused uint64_t?
Edit:
to clarify: My goal is to minimize the storage size, not the transmission size
Edit2:
What i meant by having a varying size: The size is determined on compile time, not while running the program.
The ATmega328p is a 8bit processor. All of its instructions are 8bit. Nothing will be faster than simply having an uint8_t array.
What you can do is, when you compile, look at your .lss file, it will show you the assmebly code, then you can look up the AVR instruction set and see the clock cycles each one will take. I think you will find using a uint64_t will just add unncessary overhead unless you are very careful with the way you are putting the bytes into it.
If the length of your packages might vary, the most efficient approach would be to compress the package before communication.
For example the first 3 bits of each package, could determine the size of that package.
The compressed packages are communicated faster, and use up less memory space.

decompression algorithms that can work virtually without RAM (LZ like if possible)

edit: I try to rephrase this as to make this clearer the best I can :)
I need to find a suitable way / choose a suitable compression to store a blob of data (say approx. 900KB) in a ROM where the available amount of free space is only about 700KB. If I compress the blob with some modern compression tool (eg. WinZIP/WinRAR) I can achieve the required compression easily.
The matter here is that the decompression will take place on a very VERY VERY limited hardware where I can't afford to have more than few bytes of RAM available (say no more than 100 bytes, for the sake of it).
I already tried RLE'ing the data... the data hardly compress.
While I'm working trying to change the data blob format so that it could potentially have more redundancy and achieve better compression ratio, I'm at the same time seeking a compression method that will enable me to decompress on my limited hardware. I have a limited knowledge of compression algorithms so I'm seeking suggestions/pointers to continue with my hunt.
Thanks!
Original question was "I need info/pointers on decompression algorithms that can work without using the uncompressed data, as this will be unavailable right after decompression. LZ like approaches would still be preferred."
I'm afraid this is off topic because too broad.
LZW uses a sizable state that is not very different from keeping a slice of uncompressed data. Even if the state is constant and read from ROM, handling it with just registers seems difficult. There are many different algorithms than can use a constant state, but if you really have NO RAM, then only the most basic algorithms can be used.
Look up RLE, run length encoding.
EDIT: OK, no sliding window, but if you can access ROM, 100 bytes of RAM give you quite some possibilities. You want to implement this in assembly, so stick with very simple algorithms. RLE plus a dictionary. Given your requirements, the choice of algorithm should be based on the type of data you need to decompress.

What can be the least possible value of data-compression-ratio for any real dataset

I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.
Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.
Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.
I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.
Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.
While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)
Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.
EDIT:
#MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.
because the pigeon priciple
http://en.wikipedia.org/wiki/Pigeonhole_principle
you will always have strings that get compressed and strings that get expanded
http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/
theoretically you can achieve best compression with 0 entropy data (infinite compression ratio) and worst compression with infinite entropy data (AWGN noise, so you have 0 compression ratio).
I can't tell from your question whether you're using zlib or not. If you're using zlib, it provides a function, deflateBound(), which does exactly what you're asking for, taking an uncompressed size and returning the maximum compressed size. It takes into account how the deflate stream was initialized with deflateInit() or deflateInit2() in computing the proper header and trailer sizes.
If you're writing your own deflate, then you will already know what the maximum compressed size is based on how often you allow it to use stored blocks.
Update: The only way to know for sure the maximum data expansion of a hardware deflator is to obtain the algorithm used. Then through inspection you can determine how often it will emit stored blocks for random data.
The only alternative is empirical and unreliable. You can feed the hardware compressor random data, and examine the results. You can use infgen to disassemble the deflate output and see the stored blocks and their sizes. Then you can write a linear bounding formula for the expansion. Then add some margin to the additive and multiplicative terms to cover for situations that you did not observe in your tests.
This will only work if the hardware deflate algorithm is well behaved, which means that it will not write a fixed or dynamic deflate block if a stored block would be smaller. If it is not well behaved, then all bets are off.
The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.
This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.
This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

What is an optimal format for saving large amounts of numerical data (GBs) from a C program?

I'm a physicist that normally deals with large amounts of numerical data generated using C programs. Typically, I store everything as columns in ASCII files, but this had led to massively large files. Given that I am limited in space, this is an issue and I'd like to be a little smarter about the whole thing. So ...
Is there a better format than ASCII? Should I be using binary files, or perhaps a custom format some library?
Should I be compressing each file individually, or the entire directory? In either case, what format should I use?
Thanks a lot!
In your shoes, I would consider the standard scientific data formats, which are much less space- and time-consuming than ASCII, but (while maybe not quite as bit-efficient as pure, machine-dependent binary formats) still offer standard documented and portable, fast libraries to ease the reading and writing of the data.
If you store data in pure binary form, the metadata is crucial to make any sense out of the data again (are these numbers single or double precision, or integers and of what length, what are the arrays' dimensions, etc, etc), and issues with archiving and retrieving paired data/metadata pairs can, and in practice do, occasionally make perfectly good datasets unusable -- a real pity and waste.
CDF, in particular, is "a self-describing data format for the storage and manipulation of scalar and multidimensional data in a platform- and discipline-independent fashion" with many libraries and utilities to go with it. As alternatives, you might also consider NetCDF and HDF -- I'm less familiar with those (and such tradeoffs as flexibility vs size vs speed issues) but, seeing how widely they're used by scientists in many fields, I suspect any of the three formats could give you very acceptable results.
If you need the files for a longer time, they are important experimental data that prove somethings for you or so, don't use binary formats. You will not be able to read them when your architecture changes. dangerous. stick to text (yes ascii) files.
Choose a compression format that fits your needs. Is compression time an issue? Usually not, but check that for yourself. Is decompression time an issue? Usually yes, if you want to do data analysis on it. Under these conditions I'd go for bzip2. This is quite common nowadays, well tested, foolproof. I'd do files individually, since the larger your file, the larger the probability of losses. (Bits flip etc).
A terabyte disk is a hundred bucks. Hard to run out of space these days. Sure, storing the data in binary saves space. But there's a cost, you'll have a lot less choices to get the data out of the file again.
Check what your operating system can do. Windows supports automatic compression on folders for example, the file content get zipped by the file system without you having to do anything at all. Compression rates should compete well with raw binary data.
There's a lot of info you didn't include, but should think about:
1.) Are you storing integers or floats? What is the typical range of the numbers?
For example: storing small comma-separated integers in ascii, such as "1,2,4,2,1" will average 2-bytes per datum, but storing them as binary would require 4-bytes per datum.
If your integers are typically 3 digits, then comma-separated vs binary won't matter much.
On the other hand, storing doubles (8-byte values) will almost certainly be smaller in binary format.
2.) How do you need to access these values? If you are not concerned about access time, compress away! On the other hand, if you need speedy, random access then compression will probably hinder you.
3.) Are some values frequently repeated? Then you may consider a Huffman encoding or a table of "short-cut" values.

Resources