logically Understanding a compression algorithm - arrays

this idea had been flowing in my head for 3 years and i am having problems to apply it
i wanted to create a compression algorithm that cuts the file size in half
e.g. 8 mb to 4 mb
and with some searching and experience in programming i understood the following.
let's take a .txt file with letters (a,b,c,d)
using the IO.File.ReadAllBytes function , it gives the following array of bytes : ( 97 | 98 | 99 | 100 ) , which according to this : https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart is the decimal value of the letter.
what i thought about was : how to mathematically cut this 4-membered-array to only 2-membered-array by combining each 2 members into a single member but you can't simply mathematically combine two numbers and simply reverse them back as you have many possibilities,e.g.
80 | 90 : 90+80=170 but there is no way to know that 170 was the result of 80+90 not like 100+70 or 110+60.
and even if you could overcome that , you would be limited by the maximum value of bytes (255 bytes) in a single member of the array.
i understand that most of the compression algorithms use the binary compression and they were successful,but imagine cutting a file size in half , i would like to hear your ideas on this.
Best Regards.

It's impossible to make a compression algorithm that makes every file shorter. The proof is called the "counting argument", and it's easy:
There are 256^L possible files of length L.
Lets say there are N(L) possible files with length < L.
If you do the math, you find that 256^L = 255*N(L)+1
So. You obviously cannot compress every file of length L, because there just aren't enough shorter files to hold them uniquely. If you made a compressor that always shortened a file of length L, then MANY files would have to compress to the same shorter file, and of course you could only get one of them back on decompression.
In fact, there are more than 255 times as many files of length L as there are shorter files, so you can't even compress most files of length L. Only a small proportion can actually get shorter.
This is explained pretty well (again) in the comp.compression FAQ:
http://www.faqs.org/faqs/compression-faq/part1/section-8.html
EDIT: So maybe you're now wondering what this compression stuff is all about...
Well, the vast majority of those "all possible files of length L" are random garbage. Lossless data compression works by assigning shorter representations (the output files) to the files we actually use.
For example, Huffman encoding works character by character and uses fewer bits to write the most common characters. "e" occurs in text more often than "q", for example, so it might spend only 3 bits to write "e"s, but 7 bits to write "q"s. bytes that hardly ever occur, like character 131 may be written with 9 or 10 bits -- longer than the 8-bit bytes they came from. On average you can compress simple English text by almost half this way.
LZ and similar compressors (like PKZIP, etc) remember all the strings that occur in the file, and assign shorter encodings to strings that have already occurred, and longer encodings to strings that have not yet been seen. This works even better since it takes into account more information about the context of every character encoded. On average, it will take fewer bits to write "boy" than "boe", because "boy" occurs more often, even though "e" is more common than "y".
Since it's all about predicting the characteristics of the files you actually use, it's a bit of a black art, and different kinds of compressors work better or worse on different kinds of data -- that's why there are so many different algorithms.

Related

LZSS vs. LZ77 compression difference

Can someone please explain the difference between the LZSS and the LZ77 algorithm. I've been looking online for a couple of hours but I couldn't find the difference. I found the LZ77 algorithm and I understood its implementation.
But, how does LZSS differ from LZ77? Let's say if we have an string "abracadabra" how is LZSS gonna compress it differently from LZ77? Is there a C pseudo-code that I could follow?
Thank you for your time!
Unfortunately, both terms LZ77 and LZSS tend to be used very loosely, so they do not really imply very specific algorithms. When people say that they compressed their data using an LZ77 algorithm, they usually mean that they implemented a dictionary based compression scheme, where a fixed-size window into the recently decompressed data serves as the dictionary and some words/phrases during the compression are replaced by references to previously seen words/phrases within the window.
Let us consider the input data in the form of the word
abracadabra
and assume that window can be as large as the input data. Then we can represent "abracadabra" as
abracad(-7,4)
Here we assume that letters are copied as is, and that the meaning of two numbers in brackets is "go 7 positions back from where we are now and copy 4 symbols from there", which reproduces "abra".
This is the basic idea of any LZ77 compressor. Now, the devil is in the detail. Note that the original word "abracadabra" contains 11 letters, so assuming ASCII representation the word, it is 11 bytes long. Our new representation contains 13 symbols, so if we assume the same ASCII representation, we just expanded the original message, instead of compressing it. One can prove that this can sometimes happen to any compressor, no matter how good it actually is.
So, the compression efficiency depends on the format in which you store the information about uncompressed letters and back references. The original paper where the LZ77 algorithm was first described (Ziv, J. & Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3), 337-343) uses the format that can be loosely described here as
(0,0,a)(0,0,b)(0,0,r)(0,1,c)(0,1,d)(0,3,a)
So, the compressed data is the sequence of groups of three items: the absolute (not relative!) position in the buffer to copy from, the length of the dictionary match (0 means no match was found) and the letter that follows the match. Since most letters did not match anything in the dictionary, you can see that this is not a particularly efficient format for anything but very compressible data.
This inefficiency may well be the reason why the original form of LZ77 has not been used in any practical compressors.
SS in the "LZSS" refers to a paper that was trying to generalize the ideas of dictionary compression with the sliding window (Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29(4), 928-951). The paper itself looks at several variations of dictionary compression schemes with windows, so once again, you will not find an explicit "algorithm" in it. However, the term LZSS is used by most people to describe the dictionary compression scheme with flag bits, e.g. describing "abracadabra" as
|0a|0b|0r|0a|0c|0a|0d|1-7,4|
where I added vertical lines purely for clarity. In this case numbers 0 and 1 are actually prefix bits, not bytes. Prefix bit 0 says "copy the next byte into the output as is". Prefix bit 1 says "next follows the information for copying a match". Nothing else is really specific, term LZSS is used to say something specific about the use of these prefix signal bits. Hopefully you can see how this can be done compactly, in fact much more efficiently than the format described in LZ77 paper.

Suggestions on to make my compressor faster

I have some data which I'm compressing with a custom compressor, the compressed data is fine but the compressor takes ages, and I'm seeking advice on how I could make that faster. Let me give you all the details.
The input data is an array of bytes, maximum 2^16 of them. Since those bytes in the array NEVER assume values between 0x08 and 0x37 (inclusive), I decided that I could exploit that for a simple LZ-like compression scheme that works by replacing any found sequence of 4 to 51 bytes in length that is already been found at a "lower address" (I mean closer to the array's beginning) with a single byte in the 0x08 to 0x37 range that would then be followed by two bytes addressing the low and high byte of the index of the beginning of the sequence, thus giving the decompressor the length (within that single byte) and address of the original data, to rebuild the original array.
The compressor works this way: for any sequence of any length from 51 to 4 bytes (I test longer sequences first) starting from any index (from left to right) I check if there's a correspondence 'left' of that, which means at an index which is lower than the starting point I'm checking. In case there is more than a single match, I choose the match that 'saves' more, which means the longer correspondence starting at the leftmost place.
The results are just perfect... but of course this is over-killing - it's 4 nested 'for' cycles with a memcmp() inside that, and it takes minutes on a modern workstation to compress some 20 KB worth of data, and that's why I'm seeking help.
Code is accessible here, if you need to sneak a peek. The 'job' starts at line 44.
Of course I can give you any detail you need, there's nothing secret here (BTW, just in case... I'm not going to change compression scheme for this reason, as this one works exactly as I need it!)
Thank you in advance.
A really obvious one is that you don't have to loop over the lengths, just find out what the longest match at that position is. That's not a "search", just keep extending the match by 1 for every matching pair of characters. When it stops, you have the longest match at that position (naturally you can force it to stop at 51 too, so it doesn't overrun).
An other typical trick is keeping a hashmap that maps keys of 3 or 4 characters to a list of offsets where they can be found. That way you only need to try positions that have some hope of resulting in a match. This is also described in the DEFLATE RFC all the way at the bottom.

Fast string search using bitwise operators

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?
e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)
*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits:
'G':00, 'A':01, 'T':10, 'C':11
*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters
*by fastest I mean not including any pre-processing/indexing time
*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.
*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.
*KMP wouldn't work well as the alphabet is small
*C code, x86 machine code would all be interesting.
Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html
Related:
Fastest way to scan for bit pattern in a stream of bits
Algorithm help! Fast algorithm in searching for a string with its partner
http://www.arstdesign.com/articles/fastsearch.html
http://en.wikipedia.org/wiki/Bitap_algorithm
If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.
I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.
If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).
What this means is that, once you compress your DNA:
You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.
As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.
Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.
Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)
It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.
The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.
This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.
You could create a state machine. In this topic,
Fast algorithm to extract thousands of simple patterns out of large amounts of text
, I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).
BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.
Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.
select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.
create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.
scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.
reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.
I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.
I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

How do I choose a good magic number for my file format?

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

Resources