Can I make zlib to fully inflate a known number of bytes from the beginning? - zlib

I have a data stream I want to compress and decompress with zlib. The stream consists of N sequential blocks of varying size. The data are similar, so I want to concatenate them to achieve better compression and keep a separate index of sizes to split them apart later. But I don't have to always decompress all of them, only up to M blocks from the beginning. Can I do this? I.e. can I tell zlib to keep decompressing the stream until I get 123456 bytes worth of decompressed data? Is it that simple as just telling zlib the size of the buffer and waiting for inflate() to return Z_OK or I need to specify some of the FLUSH constants?
Usage scenario: This can be a set of updates for a database, the most recent is packed first, the oldest last. Assume I have updates from 0 (no db) to 5. I pack them in reversed order: 5, 4, 3, 2, 1, 0. When the database does not exist, I extract them all and process from the last one to the first, fully creating the database and applying all the updates. But if the database is already there and is at v3, I only extract and apply updates 5 and 4, i.e. the first two blocks. I can, of course, pack the blocks separately, but if I concatenate them I'll get better compression.

Yes. Just provide an output buffer of the desired size, and make compressed data available as input to inflate() until it fills the output.
Note that to decompress block k, you will need to first decompress blocks 1..k-1.

Related

Is Minecraft missing zlib uncompressed size in it's chunk/region data?

Info on minecraft's region files
Minecraft's region files are stored in 3 sections, the first two giving information about where the chunks are stored, and information about the chunks themselves. In the final section, chunks are given as a 4-byte number length, the type of compression it uses, (almost always is zlib, RFC1950)
Here's more (probably better) information: https://minecraft.gamepedia.com/Region_file_format
The problem
I have a program that successfully loads chunk data. However, I'm not able to find how big the chunks will be when decompressed, and so I just use a maximum amount it could take when allocating space.
In the player data files, they do give the size that it takes when decompressed, and (I think) it uses the same type of compression.
The end of a player.dat file giving the size of the decompressed data (in little-endian):
This is the start of the chunk data, first 4 bytes giving how many bytes is in the following compressed data:
Mystery data
However, if I look where the compressed data specifically "ends", there's still a lot of data after it. This data doesn't seem to have a use, but if I try to decompress any of it with the rest of the chunk, I get an error.
Highlighted chunk data, and unhighlighted mystery data:
Missing decompressed size (header?)
And there's no decompressed size (or header? I could be wrong here) given.
The final size of this example chunks is 32,562 bytes, and this number (or any close neighbours) is nowhere to be found within the chunk data or mystery data. (Checked both big-endian, and little-endian)
Decompressed data terminating at index 32562, (Visual Studio locals watch):
Final Questions
Is there something I'm missing? Is this compression actually different from the player data compression? What's the mystery data? And am I stuck loading in 1<<20 bytes every time I want to load a chunk from a region file?
Thank you for any answers or suggestions
Files used
Isolated chunk data: https://drive.google.com/file/d/1n3Ix8V8DAgR9v0rkUCXMUuW4LJjT1L8B/view?usp=sharing
Full region data: https://drive.google.com/file/d/15aVdyyKazySaw9ZpXATR4dyvhVtrL6ZW/view?usp=sharing
(Not linking player data for possible security reasons)
In the region data, the chunk data starts at index 1208320 (or 0x127000)
The format information you linked is quite helpful. Have you read it?
In there it says: "The remainder of the file consists of data for up to 1024 chunks, interspersed with unused space." Furthermore, "Minecraft always pads the last chunk's data to be a multiple-of-4096B in length" (Italics mine.) Everything is in multiples of 4K, so the end of every chunk is padded to the next 4K boundary.
So your "mystery" data is not a mystery at all, as it is entirely expected per the format documentation. That data is simply junk to be ignored.
Note that, from the documentation, that the data "length" in the first three bytes of the chunk is actually one more than the number of bytes of data in the chunk (following the five-byte header).
Also from the documentation, there is indeed no uncompressed size provided in the format.
zlib was designed for streaming data, where you don't know ahead of time how much there will be. You can use inflate() to decompress into whatever buffer size you like. If there's not enough room to finish, you can either do something with that data and then repeat into the same buffer, or you can grow the buffer with realloc() in C, or the equivalent for whatever language you're using. (Not noted in the question or tags.)

Zlib minimum deflate size

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.
is there some way to GUARANTEE that the output compressed data will be
SMALLER than the input data?
Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.
(Also, I know about deflateBound, but would rather avoid it because of
the overhead.)
Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.
By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.
due to the overhead of things like headers and dictionaries in the LZW
stream.
deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.
Or, to put it another way, is there some minimum buffer size that I
can use for input/output buffers ...
The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.

In what situation would compressed data be larger than input?

I need to handle compression of data that's largely UTF-8 HTML content in a utility I'm working on. The utility uses zLib and the deflate algorithm to compress data. Is it safe to assume that if the input data size is over 1 kB, compressed data will always be smaller than uncompressed input? (Input data below 1 kB will not be compressed.)
I'm trying to see situations where this assumption would break but apart from near-perfect random input, it seems a safe assumption to me.
Edit: the reason I'm wondering about this assumption is because I already have a buffer allocated that's as big as the input data. If my assumption holds, I can reuse this same buffer and avoid another memory allocation.
No. You can never assume that the compressed data will always be smaller. In fact, if any sequence is compressed by the algorithm, then you are guaranteed that some other sequence is expanded.
You can use zlib's deflate() function to compress as much as it can into your 1K buffer. Do whatever you need to with that result, then continue with another deflate() call writing into that same buffer.
Alternatively you can allocate a buffer big enough for the largest expansion. The deflateBound() or compressBound() functions will tell you how much that is. It's only a small amount more.
As far as I know, a sequence of 128 bytes with values 0, 1, 2, ..., 127 will not be compressed by zLib. Technically, it's possible to intentionally create an HTML page that will break your compression scheme, but with normal innocent HTML data you should be almost perfectly safe.
But almost perfectly is not perfectly. If you already have a buffer of that size, I'd advise to attempt the compression with this buffer, and if it turns out that the buffer is not enough (I suppose zLib has means of indicating that), then allocate a larger buffer or simply store an uncompressed version. And make sure you are writing these cases into some log so you could see if it ever fires :)

How to determine the actual usage of a malloc'ed buffer

I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)
As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.
Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.
If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).
A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.
you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler

Split file occupying the same memory space as source file

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.
The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.
As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

Resources