Only decompress a specific bzip2 block - archive

Say I have a bzip2 file (over 5GB), and I want to decompress only block #x, because there is where my data is (block is different every time). How would I do this?
I thought about making an index of where all the blocks are, then cut the block I need from the file and apply bzip2recover to it.
I also thought about compressing say 1MB at a time, then appending this to a file (and recording the location), and simply grabbing the file when I need it, but I'd rather keep the original bzip2 file intact.
My preferred language is Ruby, but any language's solution is fine by me (as long as I understand the principle).

There is a http://bitbucket.org/james_taylor/seek-bzip2
Grab the source, compile it.
Run with
./seek-bzip2 32 < bzip_compressed.bz2
to test.
the only param is bit displacement of wondered block header. You can get it with finding a "31 41 59 26 53 59 " hex string in the binary file. THIS WAS INCORRECT. Block start may be not aligned to byte boundary, so you should search for every possible bit shifts of "31 41 59 26 53 59" hex string, as it is done in bzip2recover - http://www.bzip.org/1.0.3/html/recovering.html
32 is bit size of "BZh1" header where 1 can be any digit from "1" to "9" (in classic bzip2) - it is a (uncompressed) block size in hundreds of kb (not exact).

It's true that bzip-table is almost as slow as decompressing but of course you only have to do it once and you can store the output in some fashion to use as an index. This is perfect for what I need but may not be what everybody needs.
I did need a little help getting it to compile on Windows though.

Related

logically Understanding a compression algorithm

this idea had been flowing in my head for 3 years and i am having problems to apply it
i wanted to create a compression algorithm that cuts the file size in half
e.g. 8 mb to 4 mb
and with some searching and experience in programming i understood the following.
let's take a .txt file with letters (a,b,c,d)
using the IO.File.ReadAllBytes function , it gives the following array of bytes : ( 97 | 98 | 99 | 100 ) , which according to this : https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart is the decimal value of the letter.
what i thought about was : how to mathematically cut this 4-membered-array to only 2-membered-array by combining each 2 members into a single member but you can't simply mathematically combine two numbers and simply reverse them back as you have many possibilities,e.g.
80 | 90 : 90+80=170 but there is no way to know that 170 was the result of 80+90 not like 100+70 or 110+60.
and even if you could overcome that , you would be limited by the maximum value of bytes (255 bytes) in a single member of the array.
i understand that most of the compression algorithms use the binary compression and they were successful,but imagine cutting a file size in half , i would like to hear your ideas on this.
Best Regards.
It's impossible to make a compression algorithm that makes every file shorter. The proof is called the "counting argument", and it's easy:
There are 256^L possible files of length L.
Lets say there are N(L) possible files with length < L.
If you do the math, you find that 256^L = 255*N(L)+1
So. You obviously cannot compress every file of length L, because there just aren't enough shorter files to hold them uniquely. If you made a compressor that always shortened a file of length L, then MANY files would have to compress to the same shorter file, and of course you could only get one of them back on decompression.
In fact, there are more than 255 times as many files of length L as there are shorter files, so you can't even compress most files of length L. Only a small proportion can actually get shorter.
This is explained pretty well (again) in the comp.compression FAQ:
http://www.faqs.org/faqs/compression-faq/part1/section-8.html
EDIT: So maybe you're now wondering what this compression stuff is all about...
Well, the vast majority of those "all possible files of length L" are random garbage. Lossless data compression works by assigning shorter representations (the output files) to the files we actually use.
For example, Huffman encoding works character by character and uses fewer bits to write the most common characters. "e" occurs in text more often than "q", for example, so it might spend only 3 bits to write "e"s, but 7 bits to write "q"s. bytes that hardly ever occur, like character 131 may be written with 9 or 10 bits -- longer than the 8-bit bytes they came from. On average you can compress simple English text by almost half this way.
LZ and similar compressors (like PKZIP, etc) remember all the strings that occur in the file, and assign shorter encodings to strings that have already occurred, and longer encodings to strings that have not yet been seen. This works even better since it takes into account more information about the context of every character encoded. On average, it will take fewer bits to write "boy" than "boe", because "boy" occurs more often, even though "e" is more common than "y".
Since it's all about predicting the characteristics of the files you actually use, it's a bit of a black art, and different kinds of compressors work better or worse on different kinds of data -- that's why there are so many different algorithms.

What is an efficient way to map a file to array of structures?

I have a file in my system with 1024 rows,
student_db.txt
Name Subject-1 Subject-2 Subject-3
----- --------- --------- ---------
Alex 98 90 80
Bob 87 95 73
Mark 90 83 92
.... .. .. ..
.... .. .. ..
I have an array structures in my C code,
typedef struct
{
char name[10];
int sub1;
int sub2;
int sub3;
} student_db;
student_db stud_db[1024];
What is the efficient way to read this file and mapping to this array of structures ?
If the number entries is less then we can go for normal fgets in a while with strtok but here number of entries is 1024.
So please suggest some efficient way to do this task.
Check the size of the file, I think, it is max. 100 KByte. This is literally nothing, even a poorly written PHP script can read it in some millisec. There's no slow method to load such small quantity of data.
I assume, loading this file is only the first step, the real task will be to process this list (search, filter etc.). Instead of optimizing the loading speed, you should focus on the processing speed.
Premature optimization is evil. Make a working unoptimized code, see if you're satisfied with the result and speed. Probably you never should optimize it.
You can try to store data in a binary way. The way file is written now is character way. So everything is represented as a character (numbers, strings and other things).
If you store things binary, then it means you store numbers, strings and other stuff the way they are, so when you write integer to a file you write a number, when you open this in a text editor later, then you will see character, which corresponding number in eg. ASCII is the one you have written to a file.
You use standard functions to store things in a binary way:
fopen("test.bin","wb");
w stands for write and b stands for binary format. And function to write is:
fwrite(&someInt, sizeof(someInt), 1, &someFile);
where someInt is variable you want to write (function takes pointer),
sizeof(someInt) is sizeof element int,
1 stands for number of elements if first argument is an array, and someFile is a file, where you want to store your data.
This way size of the file can be reduced, so loading of the file also will be faster. It's also simplier to process sata in file

Cut a file in parts (without using more space) [duplicate]

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.
The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.
As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

File compression and codes

I'm implementing a version of lzw. Let's say I start off with 10 bit codes and increase whenever I max out on codes. For example after 1024 codes, I'll need 11 bits to represent 1025. Issue is in expressing the shift.
How do I tell decode that I've changed the code size? I thought about using 00, but the program can't distinguish between 00 as an increment and 00 as just two instances of code zero.
Any suggestions?
You don't. You shift to a new size when the dictionary is full. The decoder's dictionary is built synchronized with the encoder's dictionary, so they'll both be full at the same time, and the decoder will shift to the new size exactly when the encoder does.
The time you have to send a code to signal a change is when you've filled the dictionary completely -- you've used all of the largest codes available. In this case, you generally want to continue using the dictionary until/unless the compression rate starts to drop, then clear the dictionary and start over. You do need to put some marker in to tell when that happens. Typically, you reserve the single largest code for this purpose, but any code you don't use for any other purpose will work.
Edit: as an aside, note that you normally want to start with codes exactly one bit larger than the codes for the input, so if you're compressing 8-bit bytes, you want to start with 9 bit codes.
This is part of the LZW algorithm.
When decompressing you automatically build up the code dictionary again. When a new code exactly fills the current number of bits, the code size has to be increased.
For the details see Wikipedia.
You increase the number of bits when you create the code for 2n-1. So when you create the code 1023, increase the bit size immediately. You can get a better description from the GIF compression scheme. Note that this was a patented scheme (which partly caused the creation of PNG). The patent has probably expired by now.
Since the decoder builds the same table as the compressor, its table is full on reaching the last element (so 1023 in your example), and as a consequence, the decoder knows that the next element will be 11 bits.

Split file occupying the same memory space as source file

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.
The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.
As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

Resources