I have an array of 10345 bytes, I want to compress the array and then decompress, kindly suggest me the compression algorithm which can reduce the size of array. I am using c language, and the array is of unsigned char type.
Rephrased: Can someone suggest a general-purpose compression algorithm (or library) for C/C++?
zlib
Lossless Compression Algorithms
This post's a community wiki. I don't want any points for this -- I've already voted to close the question.
The number of bytes to compress has very little to do with choice of compression algorithm, although it does affect the implementation. For example, when you have fewer than 2^15 bytes to compress, if you are using ZLib, you will want to specify a compression-level of less than 15. The compression-level in Zlib (one of the two such parameters) controls the depth of the "look-back" dictionary. If your file is shorter than 16k bytes, then a 32k look-back dictionary will never half-fill; in that case, use one less bit of pointer into the look-back for a 1/15th edge on the compression compared to setting ZLib to "max."
The content of the data is what matters. If you are sending images with mostly background, then you might want Run Length Encoding (used by Windows .BMP, for example).
If you are sending mostly English text, than you wish you could use something like ZLib, which implements Huffman encoding and LZW-style look-back dictionary compression.
If your data has been encrypted, then attempting to compress it will not succeed.
If your data is a particular type of signal, and you can tolerate some loss of detail, then you may want to transform it into frequency space and send only the principal components. (e.g., JPEG, MP3)
Related
Can someone please explain the difference between the LZSS and the LZ77 algorithm. I've been looking online for a couple of hours but I couldn't find the difference. I found the LZ77 algorithm and I understood its implementation.
But, how does LZSS differ from LZ77? Let's say if we have an string "abracadabra" how is LZSS gonna compress it differently from LZ77? Is there a C pseudo-code that I could follow?
Thank you for your time!
Unfortunately, both terms LZ77 and LZSS tend to be used very loosely, so they do not really imply very specific algorithms. When people say that they compressed their data using an LZ77 algorithm, they usually mean that they implemented a dictionary based compression scheme, where a fixed-size window into the recently decompressed data serves as the dictionary and some words/phrases during the compression are replaced by references to previously seen words/phrases within the window.
Let us consider the input data in the form of the word
abracadabra
and assume that window can be as large as the input data. Then we can represent "abracadabra" as
abracad(-7,4)
Here we assume that letters are copied as is, and that the meaning of two numbers in brackets is "go 7 positions back from where we are now and copy 4 symbols from there", which reproduces "abra".
This is the basic idea of any LZ77 compressor. Now, the devil is in the detail. Note that the original word "abracadabra" contains 11 letters, so assuming ASCII representation the word, it is 11 bytes long. Our new representation contains 13 symbols, so if we assume the same ASCII representation, we just expanded the original message, instead of compressing it. One can prove that this can sometimes happen to any compressor, no matter how good it actually is.
So, the compression efficiency depends on the format in which you store the information about uncompressed letters and back references. The original paper where the LZ77 algorithm was first described (Ziv, J. & Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3), 337-343) uses the format that can be loosely described here as
(0,0,a)(0,0,b)(0,0,r)(0,1,c)(0,1,d)(0,3,a)
So, the compressed data is the sequence of groups of three items: the absolute (not relative!) position in the buffer to copy from, the length of the dictionary match (0 means no match was found) and the letter that follows the match. Since most letters did not match anything in the dictionary, you can see that this is not a particularly efficient format for anything but very compressible data.
This inefficiency may well be the reason why the original form of LZ77 has not been used in any practical compressors.
SS in the "LZSS" refers to a paper that was trying to generalize the ideas of dictionary compression with the sliding window (Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29(4), 928-951). The paper itself looks at several variations of dictionary compression schemes with windows, so once again, you will not find an explicit "algorithm" in it. However, the term LZSS is used by most people to describe the dictionary compression scheme with flag bits, e.g. describing "abracadabra" as
|0a|0b|0r|0a|0c|0a|0d|1-7,4|
where I added vertical lines purely for clarity. In this case numbers 0 and 1 are actually prefix bits, not bytes. Prefix bit 0 says "copy the next byte into the output as is". Prefix bit 1 says "next follows the information for copying a match". Nothing else is really specific, term LZSS is used to say something specific about the use of these prefix signal bits. Hopefully you can see how this can be done compactly, in fact much more efficiently than the format described in LZ77 paper.
I am currently trying to write a compressor and decompressor with the same purpose as the RFC Deflate specification.
I'm not able to understand the difference between how blocks are composed in the compression with fixed tables and dynamic tables. The file is processed by LZ77 generating (distance, length) + literal.
How do I know the type of block?
Do I have to compress this data?
Given that I use a fixed compression and don't have to send the tables, how would the encoder know how to encode data?
Moreover, do I have to send data before the actual compression executes?
I am confused on the difference between fixed tables and the table we send in the dynamic mode, and how the two blocks use them to encode data.
I'm currently reading Data Compression: The Complete Reference. Any advice will be helpful.
Since you are trying to compress, you would pick the smaller of the two. zlib's deflate computes what the size of a fixed block, a dynamic block, and a stored block would be, and emits the smallest of the three.
If you are encoding a fixed block, you encode using the fixed code for literal/lengths and distances. This code is provided in the RFC.
my c linux based program inputs are:
char *in_str, char *find_str, char *replacing_str
the in_str is a compressed data (gzip).
the program needs to find for the find_str within the uncompressed input data, replace it with replacing_str, and then to recompress the data.
the trivial way to do so is by using one of the many available gzip compress/uncompress libraries to uncompress the data, manipulate the uncompressed data, and then to recompress the output. However i need to make it as efficient as possible (it is a RT program).
i wonder if it is more efficient to use an on-the-fly library (e.g. zlibc) approach or simply do the operation as described above.
maybe it is important to mention that:
the find_str and replacing_str strings are a small portion of the data
their lengths are not equal
the find_str supposed to appear about 4 or 5 times
the uncompressed data len is ~2K - 6K bytes
does anyone familiar with an efficient way to implement this?
Thanks
You are going to have to decompress no matter what, in order to search for the strings. (You might be able to get away with doing that only once and building an index. However that might be much larger than the uncompressed data, so you might as well just store it uncompressed instead.)
You can avoid recompressing all of it by preparing the gzip file ahead of time to be compressed in smaller historyless units using, for example, the Z_FULL_FLUSH option of zlib. This will reduce compression slightly depending on how often you do it, but will speed up building the output greatly if only one of many blocks need to be recompressed.
I've been doing some reading on file formats and I'm very interested in them. I'm wondering what the process is to create a format. For example, a .jpeg, or .gif, or an audio format. What programming language would you use (if you use a programming language at all)?
The site warned me that this question might be closed, but that's just a risk I'll take in the pursuit of knowledge. :)
what the process is to create a format. For example, a .jpeg, or .gif, or an audio format.
Step 1. Decide what data is going to be in the file.
Step 2. Design how to represent that data in the file.
Step 3. Write it down so other people can understand it.
That's it. A file format is just an idea. Properly, it's an "agreement". Nothing more.
Everyone agrees to put the given information in the given format.
What programming language would you use (if you use a programming language at all)?
All programming languages that can do I/O can have file formats. Some have limitations on which file formats they can handle. Some languages don't handle low-level bytes as well as others.
But a "format" is not an "implementation".
The format is a concept. The implementation is -- well -- an implementation.
You do not need a programming language to write the specification for a file format, although a word processor might prove to be a handy tool.
Basically, you need to decide how the information of the file is to be stored as a sequence of bits. This might be trivial, or it might be exceedingly difficult. As a trivial example, a very primitive bitmap image format could start with one unsigned 32-bit integer representing the width of the bitmap, and then one more such integer representing the height of the bitmap. Then you could decide to simply write out the colour of the pixels sequentially, left-to-right and top-to-bottom (row 1 of pixels, row 2 of pixels, ...), using 24-bits per pixel, on the form 8 bits for red + 8 bits for green + 8 bits for blue. For instance, a 8×8 bitmap consisting of alternating blue and red pixels would be stored as
00000008000000080000FFFF00000000FFFF0000...
In a less trivial example, it really depends on the data you wish to save. Typically you would define a lot of records/structures, such as BITMAPINFOHEADER, and specify in what order they should come, how they should be nestled, and you might need to write a lot of indicies and look-up tables. Myself I have written quite a few file formats, most recently the ASD (AlgoSim Data) file format used to save AlgoSim structures. Such files consists of a number of records (maybe nestled), look-up tables, magic words (indicating structure begin, structures end, etc.) and strings in a custom-defined format. One typical thing that often simplifies the file format is that the records contain data about their size, and the sizes of the custom data parts following the record (in case the record is some sort of a header, preceeding data in a custom format, e.g. pixel colours or sound samples).
If you havn't been working with file formats before, I would suggest that you learn a very simple format, such as the Windows 3 Bitmap format, and write your own BMP encoder/decoder, i.e. programs that creates and reads BMP files (from scratch), and displays the read BMP files. Then you now the basic ideas.
Fundamentally, files only exist to store information that needs to be loaded back in the future, either by the same program or a different one. A really good file format is designed so that:
Any programming language can be used to read or write it.
The information a program would most likely need from the file can be accessed quickly and efficiently.
The format can be extended and expanded in the future, without breaking backwards compatibility.
The format should accommodate any special requirements (e.g. error resiliency, compression, encoding, etc.) present in the domain in which the file will be used
You are most certainly interested in looking into Protocol Buffers and Thrift. These tools provide a modern, principled way of designing forwards and backward compatible file formats.
I'm working on a C library that reads tag information from music files. I've already got ID3v2 taken care of, but I can't figure out how Ogg files are structured.
I opened a .ogg file in a hexeditor and I could find the tag data because that was all human readable. But everything from the beginning of the file to the tag data looked like garbage. How is this data encoded?
I don't need any help in the actual code, I just need help visualizing what a Ogg header looks like and what encoding it uses so I that I can read it. I'd like to use a non-hacky approach to reading Ogg files.
I've been looking at the Flac format, which has been helpful.
The Flac file I'm looking at has about 350 bytes between the "fLac" identifier and the human readable Comments section, and none of it is human readable in my hex editor, so I'm sure there has to be something important in there.
I'm using Linux, and I have no intention of porting to Windows or OS X. So if I need to use a glibc only function to convert the encoding, I'm fine with that.
The Ogg file format is documented here. There is a very nice graphical visualization as you requested with a detailed written description.
You may also want to look at libogg which is a open source BSD-licensed library for reading and writing Ogg files.
As is described in the link you provided, the following metadata blocks can occur between the "fLaC" marker and the VORBIS_COMMENT metadata block.
STREAMINFO: This block has information about the whole stream, like sample rate, number of channels, total number of samples, etc. It must be present as the first metadata block in the stream. Other metadata blocks may follow, and ones that the decoder doesn't understand, it will skip.
APPLICATION: This block is for use by third-party applications. The only mandatory field is a 32-bit identifier. This ID is granted upon request to an application by the FLAC maintainers. The remainder is of the block is defined by the registered application. Visit the registration page if you would like to register an ID for your application with FLAC.
PADDING: This block allows for an arbitrary amount of padding. The contents of a PADDING block have no meaning. This block is useful when it is known that metadata will be edited after encoding; the user can instruct the encoder to reserve a PADDING block of sufficient size so that when metadata is added, it will simply overwrite the padding (which is relatively quick) instead of having to insert it into the right place in the existing file (which would normally require rewriting the entire file).
SEEKTABLE: This is an optional block for storing seek points. It is possible to seek to any given sample in a FLAC stream without a seek table, but the delay can be unpredictable since the bitrate may vary widely within a stream. By adding seek points to a stream, this delay can be significantly reduced. Each seek point takes 18 bytes, so 1% resolution within a stream adds less than 2k. There can be only one SEEKTABLE in a stream, but the table can have any number of seek points. There is also a special 'placeholder' seekpoint which will be ignored by decoders but which can be used to reserve space for future seek point insertion.
Just after the above description, there's also the specification of the format of each of those blocks. The link also says
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
So, what are you missing? You say
I'd like a non-hacky approach to reading Ogg files.
Why re-write a library to do that when they already exist?