File compression and codes - c

I'm implementing a version of lzw. Let's say I start off with 10 bit codes and increase whenever I max out on codes. For example after 1024 codes, I'll need 11 bits to represent 1025. Issue is in expressing the shift.
How do I tell decode that I've changed the code size? I thought about using 00, but the program can't distinguish between 00 as an increment and 00 as just two instances of code zero.
Any suggestions?

You don't. You shift to a new size when the dictionary is full. The decoder's dictionary is built synchronized with the encoder's dictionary, so they'll both be full at the same time, and the decoder will shift to the new size exactly when the encoder does.
The time you have to send a code to signal a change is when you've filled the dictionary completely -- you've used all of the largest codes available. In this case, you generally want to continue using the dictionary until/unless the compression rate starts to drop, then clear the dictionary and start over. You do need to put some marker in to tell when that happens. Typically, you reserve the single largest code for this purpose, but any code you don't use for any other purpose will work.
Edit: as an aside, note that you normally want to start with codes exactly one bit larger than the codes for the input, so if you're compressing 8-bit bytes, you want to start with 9 bit codes.

This is part of the LZW algorithm.
When decompressing you automatically build up the code dictionary again. When a new code exactly fills the current number of bits, the code size has to be increased.
For the details see Wikipedia.

You increase the number of bits when you create the code for 2n-1. So when you create the code 1023, increase the bit size immediately. You can get a better description from the GIF compression scheme. Note that this was a patented scheme (which partly caused the creation of PNG). The patent has probably expired by now.

Since the decoder builds the same table as the compressor, its table is full on reaching the last element (so 1023 in your example), and as a consequence, the decoder knows that the next element will be 11 bits.

Related

Suggestions on to make my compressor faster

I have some data which I'm compressing with a custom compressor, the compressed data is fine but the compressor takes ages, and I'm seeking advice on how I could make that faster. Let me give you all the details.
The input data is an array of bytes, maximum 2^16 of them. Since those bytes in the array NEVER assume values between 0x08 and 0x37 (inclusive), I decided that I could exploit that for a simple LZ-like compression scheme that works by replacing any found sequence of 4 to 51 bytes in length that is already been found at a "lower address" (I mean closer to the array's beginning) with a single byte in the 0x08 to 0x37 range that would then be followed by two bytes addressing the low and high byte of the index of the beginning of the sequence, thus giving the decompressor the length (within that single byte) and address of the original data, to rebuild the original array.
The compressor works this way: for any sequence of any length from 51 to 4 bytes (I test longer sequences first) starting from any index (from left to right) I check if there's a correspondence 'left' of that, which means at an index which is lower than the starting point I'm checking. In case there is more than a single match, I choose the match that 'saves' more, which means the longer correspondence starting at the leftmost place.
The results are just perfect... but of course this is over-killing - it's 4 nested 'for' cycles with a memcmp() inside that, and it takes minutes on a modern workstation to compress some 20 KB worth of data, and that's why I'm seeking help.
Code is accessible here, if you need to sneak a peek. The 'job' starts at line 44.
Of course I can give you any detail you need, there's nothing secret here (BTW, just in case... I'm not going to change compression scheme for this reason, as this one works exactly as I need it!)
Thank you in advance.
A really obvious one is that you don't have to loop over the lengths, just find out what the longest match at that position is. That's not a "search", just keep extending the match by 1 for every matching pair of characters. When it stops, you have the longest match at that position (naturally you can force it to stop at 51 too, so it doesn't overrun).
An other typical trick is keeping a hashmap that maps keys of 3 or 4 characters to a list of offsets where they can be found. That way you only need to try positions that have some hope of resulting in a match. This is also described in the DEFLATE RFC all the way at the bottom.

Use bsdiff being the source and target file the same one. Other difference algorithms suitable for this?

I am trying to patch a file using bsdiff, my problem is that I have to do it having few memory available. According to this constraint I need to modify the source file with the patch in order to get the target file.
Bsdiff basic are as follows:
header: not very relevant in this explanation.
Control data block:
mixlen-> number of bytes to be modified combining the bytes from the source
file and the bytes obtained from the diff block.
copylen-> number of bytes to be added. This is totally new extra data
that need to be added to our file. This bytes are read from the
extra block.
seeklen-> number used to know which we have to read from the source file.
Compressed control block.
Compressed diff block.
Compressed extra block.
Patch file format:
0 8 BSDIFF_CONFIG_MAGIC
8 8 X
16 8 Y
24 8 sizeof(newfile)
32 X control block
32+X Y diff block
32+X+Y ??? extra block
with control block a set of triples (x,y,z) meaning "add x bytes
from oldfile to x bytes from the diff block; copy y bytes from the
extra block; seek forwards in oldfile by z bytes".
So the problem is that bsdiff considers I always have the source file without any modification, so it uses it to modify data that I have already modified (if I consider the source the same file as the target). Firstly I tried to reorder the modifications to do, but in some cases these modifications affect memory that will be used in the future for another modification. Maybe the algorithm is not suitable what I want.
Does exist another algorithm suitable for this? Is there any implementation of BSDIFF or similar doing what I need?
Before going more in depth with Bsdiff I did some research, finding VCDIFF(used by Xdelta) but it also seems to have the same behavior I haven't dug into the code though, so I don't know yet if it generate the patch in the same what as Bsdiff does.
Another point to remark would be I am trying to implement it in C.
Edited 04/10/2016:
I have tried to reorder the patch, because having the addresses to modify ordered from smaller to the bigger I thought I could handle this problem storing the original memory data into a buffer until the next modification which requires that original data had been done, but it seems that the patch order is important also, maybe in Bsdiff it modifies several times the same part of memory until it gets the right data. Any idea will be very welcome if someone knows about this.
Best regards,
Iván
We cannot eliminate the dependency on source data without impacting the compressed delta size. So, you will need to have source data unmodified to make BSDIFF work in the scenario you explained.

LabVIEW 2009 holding onto data when I don't want it to

I'm new to LabVIEW but have been building a signal analyser code that takes the required data and prints it out to text files after the data has been taken. The problem I'm having is that when it makes a new file it holds on to the data from the previous run and prints that too which is not what I want. I've attached the LabVIEW vi (ver.2009), and any help with this would be greatly appreciated.
Also if someone knows a better way of RMS-ing the data after each iteration than my mess of shift registers I'd be happy to see it.
frequency analyser (fixed).vi
To answer your main question: the part of the code that builds the string (for loop with a shift register) stores the previous data each time you re-run the vi. What you need is to initialise the shift register with an empty string :
Also a couple of notes/suggestions:
You could avoid using shift registers in this case. Divide the DAQ part of the code into say 3 parts: acquire data in the first for loop (store into array), modify the array (you could then perhaps use the build-in RMS vi), visualise on the UI
Build the code in smaller chunks, use subVi's
Keep the code small, nice and tidy (check coding standards), add comments - this will really help you later
Since you asked for advice on the RMS functionality you used I took a more detailed look of your code. And I may be harse, but it doesn't make sense (point by point):
You ask the end user for a number of runs, and then you subtract one. Why? I guess it's because the read data before the for loop. (remove that one).
The Frequency RMS function you use has support for avaraging, and has no limit of the number of averages. Specify the following configuration:
This will add RMS avaraging to you output data, and you can loose all your own calculation with shift registers.
The following code is just plain wrong:
You only shift the data, without actually changing the data. By incrementing the starting frequency you shift the FFT. So a signal that was detected at 55 Hz, no is plotted at 56 Hz. To your end user this is misleading.
One thing you need to be aware of in your code is that you don't have continious sampling. Each iteration of you for loop your data acquisition is started and stopped. You can verify this by plotting the t0's of the waveform that is captured. You'll notice they don't start at a constant interval.
A better aproach is to use the task created by the Express VI in the first iteration:
.
However you should then change the acquisition mode to 'continious samples':
Do not forget to close the task in the last iteration:
Instead of the shift register, you should work with an array which you empty before each run.

GIF LZW decompression hints?

I've read through numerous articles on GIF LZW decompression, but I'm still confused as to how it works or how to solve, in terms of coding, the more fiddly bits of coding.
As I understand it, when I get to the byte stream in the GIF for the LZW compressed data, the stream tells me:
Minimum code size, AKA number of bits the first byte starts off with.
Now, as I understand it, I have to either add one to this for the clear code, or add two for clear code and EOI code. But I'm confused as to which of these it is?
So say I have 3 colour codes (01, 10, 11), with EOI code assumed (as 00) will the byte that follows the minimum code size (of 2) be 2 bits, or will it be 3 bits factoring in the clear code? Or is the clear code/EOI code both already factored into the minimum size?
The second question is, what is the easiest way to read in dynamically sized bits from a file? Because reading an odd numbers of bits (3 bits, 12 bits etc) from an even numbered byte (8) sounds like it could be messy and buggy?
To start with your second question: yes you have to read the dynamically sized bits from an 8bit bytestream. You have to keep track of the size you are reading, and the number of unused bits left from previous read operations (used for correctly putting the 'next byte' from the file).
IIRC there is a minimum code size of 8 bits, which would give you a clear code of 256 (base 10) and an End Of Input of 257. The first stored code is then 258.
I am not sure why you did not looked up the source of one of the public domain graphics libraries. I know I did not because back in 1989 (!) there were no libraries to use and no internet with complete descriptions. I had to implement a decoder from an example executable (for MS-DOS from Compuserve) that could display images and a few GIF files, so I know that can be done (but it is not the most efficient way of spending your time).

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

Resources