Zlib decompress bytes with unknown compressed length in C - c

I am trying to write my own png reader without any external libraries. I need to use Zlib to decompress the png's IDAT chunk. I have managed to do it in python using zlib.decompress(), and I am trying to replicate it in C. I was reading over zlib's docs and found uncompress(), however it requires a destination length which I would not know.
I could set a destination to be much larger than possible for the png, but this seems like a cop-out and would break my program If I had a really big picture. However, i have found a function inflate() which can be used multiple times. If I could do this, i could realloc() memory if needed with each call. Yet I don't understand the docs for it very well and have not found much examples for this type of thing. Could anyone provide some code or help point me in the right direction?

You do know the destination length. Exactly. The PNG header information tells you how many rows, how many columns, and how many bytes per pixel. Multiply it all out, add a byte per row for the filtering, and you have your answer.
Allocate that amount of memory, and decompress into that.
Note that there can be multiple IDAT chunks, but combined they contain a single zlib stream.

Related

Is Minecraft missing zlib uncompressed size in it's chunk/region data?

Info on minecraft's region files
Minecraft's region files are stored in 3 sections, the first two giving information about where the chunks are stored, and information about the chunks themselves. In the final section, chunks are given as a 4-byte number length, the type of compression it uses, (almost always is zlib, RFC1950)
Here's more (probably better) information: https://minecraft.gamepedia.com/Region_file_format
The problem
I have a program that successfully loads chunk data. However, I'm not able to find how big the chunks will be when decompressed, and so I just use a maximum amount it could take when allocating space.
In the player data files, they do give the size that it takes when decompressed, and (I think) it uses the same type of compression.
The end of a player.dat file giving the size of the decompressed data (in little-endian):
This is the start of the chunk data, first 4 bytes giving how many bytes is in the following compressed data:
Mystery data
However, if I look where the compressed data specifically "ends", there's still a lot of data after it. This data doesn't seem to have a use, but if I try to decompress any of it with the rest of the chunk, I get an error.
Highlighted chunk data, and unhighlighted mystery data:
Missing decompressed size (header?)
And there's no decompressed size (or header? I could be wrong here) given.
The final size of this example chunks is 32,562 bytes, and this number (or any close neighbours) is nowhere to be found within the chunk data or mystery data. (Checked both big-endian, and little-endian)
Decompressed data terminating at index 32562, (Visual Studio locals watch):
Final Questions
Is there something I'm missing? Is this compression actually different from the player data compression? What's the mystery data? And am I stuck loading in 1<<20 bytes every time I want to load a chunk from a region file?
Thank you for any answers or suggestions
Files used
Isolated chunk data: https://drive.google.com/file/d/1n3Ix8V8DAgR9v0rkUCXMUuW4LJjT1L8B/view?usp=sharing
Full region data: https://drive.google.com/file/d/15aVdyyKazySaw9ZpXATR4dyvhVtrL6ZW/view?usp=sharing
(Not linking player data for possible security reasons)
In the region data, the chunk data starts at index 1208320 (or 0x127000)
The format information you linked is quite helpful. Have you read it?
In there it says: "The remainder of the file consists of data for up to 1024 chunks, interspersed with unused space." Furthermore, "Minecraft always pads the last chunk's data to be a multiple-of-4096B in length" (Italics mine.) Everything is in multiples of 4K, so the end of every chunk is padded to the next 4K boundary.
So your "mystery" data is not a mystery at all, as it is entirely expected per the format documentation. That data is simply junk to be ignored.
Note that, from the documentation, that the data "length" in the first three bytes of the chunk is actually one more than the number of bytes of data in the chunk (following the five-byte header).
Also from the documentation, there is indeed no uncompressed size provided in the format.
zlib was designed for streaming data, where you don't know ahead of time how much there will be. You can use inflate() to decompress into whatever buffer size you like. If there's not enough room to finish, you can either do something with that data and then repeat into the same buffer, or you can grow the buffer with realloc() in C, or the equivalent for whatever language you're using. (Not noted in the question or tags.)

Use bsdiff being the source and target file the same one. Other difference algorithms suitable for this?

I am trying to patch a file using bsdiff, my problem is that I have to do it having few memory available. According to this constraint I need to modify the source file with the patch in order to get the target file.
Bsdiff basic are as follows:
header: not very relevant in this explanation.
Control data block:
mixlen-> number of bytes to be modified combining the bytes from the source
file and the bytes obtained from the diff block.
copylen-> number of bytes to be added. This is totally new extra data
that need to be added to our file. This bytes are read from the
extra block.
seeklen-> number used to know which we have to read from the source file.
Compressed control block.
Compressed diff block.
Compressed extra block.
Patch file format:
0 8 BSDIFF_CONFIG_MAGIC
8 8 X
16 8 Y
24 8 sizeof(newfile)
32 X control block
32+X Y diff block
32+X+Y ??? extra block
with control block a set of triples (x,y,z) meaning "add x bytes
from oldfile to x bytes from the diff block; copy y bytes from the
extra block; seek forwards in oldfile by z bytes".
So the problem is that bsdiff considers I always have the source file without any modification, so it uses it to modify data that I have already modified (if I consider the source the same file as the target). Firstly I tried to reorder the modifications to do, but in some cases these modifications affect memory that will be used in the future for another modification. Maybe the algorithm is not suitable what I want.
Does exist another algorithm suitable for this? Is there any implementation of BSDIFF or similar doing what I need?
Before going more in depth with Bsdiff I did some research, finding VCDIFF(used by Xdelta) but it also seems to have the same behavior I haven't dug into the code though, so I don't know yet if it generate the patch in the same what as Bsdiff does.
Another point to remark would be I am trying to implement it in C.
Edited 04/10/2016:
I have tried to reorder the patch, because having the addresses to modify ordered from smaller to the bigger I thought I could handle this problem storing the original memory data into a buffer until the next modification which requires that original data had been done, but it seems that the patch order is important also, maybe in Bsdiff it modifies several times the same part of memory until it gets the right data. Any idea will be very welcome if someone knows about this.
Best regards,
Iván
We cannot eliminate the dependency on source data without impacting the compressed delta size. So, you will need to have source data unmodified to make BSDIFF work in the scenario you explained.

find and replace data on gzip content efficiently

my c linux based program inputs are:
char *in_str, char *find_str, char *replacing_str
the in_str is a compressed data (gzip).
the program needs to find for the find_str within the uncompressed input data, replace it with replacing_str, and then to recompress the data.
the trivial way to do so is by using one of the many available gzip compress/uncompress libraries to uncompress the data, manipulate the uncompressed data, and then to recompress the output. However i need to make it as efficient as possible (it is a RT program).
i wonder if it is more efficient to use an on-the-fly library (e.g. zlibc) approach or simply do the operation as described above.
maybe it is important to mention that:
the find_str and replacing_str strings are a small portion of the data
their lengths are not equal
the find_str supposed to appear about 4 or 5 times
the uncompressed data len is ~2K - 6K bytes
does anyone familiar with an efficient way to implement this?
Thanks
You are going to have to decompress no matter what, in order to search for the strings. (You might be able to get away with doing that only once and building an index. However that might be much larger than the uncompressed data, so you might as well just store it uncompressed instead.)
You can avoid recompressing all of it by preparing the gzip file ahead of time to be compressed in smaller historyless units using, for example, the Z_FULL_FLUSH option of zlib. This will reduce compression slightly depending on how often you do it, but will speed up building the output greatly if only one of many blocks need to be recompressed.

How to determine the actual usage of a malloc'ed buffer

I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)
As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.
Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.
If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).
A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.
you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler

Reading tag data for Ogg/Flac files

I'm working on a C library that reads tag information from music files. I've already got ID3v2 taken care of, but I can't figure out how Ogg files are structured.
I opened a .ogg file in a hexeditor and I could find the tag data because that was all human readable. But everything from the beginning of the file to the tag data looked like garbage. How is this data encoded?
I don't need any help in the actual code, I just need help visualizing what a Ogg header looks like and what encoding it uses so I that I can read it. I'd like to use a non-hacky approach to reading Ogg files.
I've been looking at the Flac format, which has been helpful.
The Flac file I'm looking at has about 350 bytes between the "fLac" identifier and the human readable Comments section, and none of it is human readable in my hex editor, so I'm sure there has to be something important in there.
I'm using Linux, and I have no intention of porting to Windows or OS X. So if I need to use a glibc only function to convert the encoding, I'm fine with that.
The Ogg file format is documented here. There is a very nice graphical visualization as you requested with a detailed written description.
You may also want to look at libogg which is a open source BSD-licensed library for reading and writing Ogg files.
As is described in the link you provided, the following metadata blocks can occur between the "fLaC" marker and the VORBIS_COMMENT metadata block.
STREAMINFO: This block has information about the whole stream, like sample rate, number of channels, total number of samples, etc. It must be present as the first metadata block in the stream. Other metadata blocks may follow, and ones that the decoder doesn't understand, it will skip.
APPLICATION: This block is for use by third-party applications. The only mandatory field is a 32-bit identifier. This ID is granted upon request to an application by the FLAC maintainers. The remainder is of the block is defined by the registered application. Visit the registration page if you would like to register an ID for your application with FLAC.
PADDING: This block allows for an arbitrary amount of padding. The contents of a PADDING block have no meaning. This block is useful when it is known that metadata will be edited after encoding; the user can instruct the encoder to reserve a PADDING block of sufficient size so that when metadata is added, it will simply overwrite the padding (which is relatively quick) instead of having to insert it into the right place in the existing file (which would normally require rewriting the entire file).
SEEKTABLE: This is an optional block for storing seek points. It is possible to seek to any given sample in a FLAC stream without a seek table, but the delay can be unpredictable since the bitrate may vary widely within a stream. By adding seek points to a stream, this delay can be significantly reduced. Each seek point takes 18 bytes, so 1% resolution within a stream adds less than 2k. There can be only one SEEKTABLE in a stream, but the table can have any number of seek points. There is also a special 'placeholder' seekpoint which will be ignored by decoders but which can be used to reserve space for future seek point insertion.
Just after the above description, there's also the specification of the format of each of those blocks. The link also says
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
So, what are you missing? You say
I'd like a non-hacky approach to reading Ogg files.
Why re-write a library to do that when they already exist?

Resources