Problems with portability: aligning data, endianness issues, etc - c

I'm writing a toy database management system, and running up against some alignment and endianness issues.
First, allow me to explain the data that is being stored, and where it's being stored. So first some definitions. The layout of a record is broken up into a Record Directory and Record Data.
[Field count=N] [Field offset[0]] [...] [Field offset[N-1]] [Data for fields 0 to N]
The field count and offsets combined are called the Record Directory.
The data is called the Record Data.
The field count is of type
uint16_t.
The field offset is of type
uint16_t.
The data fields can be treated as a variable length byte buffer pointed to by (uint8_t *) with a length of at least N bytes.
The field count cannot exceed: 4095 or 0x0FFF (in big endian).
The records are stored in a Page:
Pages are of size: 4096 bytes.
Pages need to store 2 bytes of data for each record.
The last 6 bytes of the page stores the running free space offset, and data for a slot directory. The metadata is irrelevant to the question, so I will not bore anyone with the details.
We're storing records on the page, by appending to the running free space offset, and appending to it. Records can later be altered and deleted. This will leave unused space fragments on the page. This data is not reused until time of compaction.
At the moment, we store a fragment byte of 0x80 in unused space (since the free space cannot exceed 0x0FFF, the first byte will never be 0x80).
However this becomes a problem during compaction time. We end up scanning everything until we hit the first byte that is not 0x80. We consider this the start of the free space. Well unfortunately, this is not portable and will only work on big endian machines.
But just to restate the issue here, the problem is distinguishing between: 0x808000 and 0x800080 where the first two bytes (read right to left) are two valid Field count fields depending on the endianness of the platform.
I want to try aligning records on even bytes. I just don't have the foresight to see if this would be a correct workaround for this issue.
At any given time, the free space offset should always sit on an even byte boundary. This means after inserting a record, you advance the free space pointer to the next even boundary.
The problem then becomes an issue of marking the fragments. Fragments are created upon deletion or altering a record (growing/shrinking by some number of bytes). I wanted to store what I would call 2-byte fragment markers: 0xFFFF. But that doesn't seem possible when altering.
This is where I'm stuck. Sorry for the long-winded problem explanation. We (my partner, this is an academic assignment) battled the problem of data ambiguity several times, and it keeps masking itself under different solutions.
Any insight would help. I hope the problem statement can be followed.

I would try this:
Align records to at least 2-byte boundaries.
Scan the list for free space as a list of uint16_t rather than char,
then look for length & 0x8000.
If you let the machine interpret integers as such instead of trying to scan
them as characters, endianness shouldn't be an issue here (at least until
you want to read your database on a different machine than the one that
wrote it).

Related

Is Minecraft missing zlib uncompressed size in it's chunk/region data?

Info on minecraft's region files
Minecraft's region files are stored in 3 sections, the first two giving information about where the chunks are stored, and information about the chunks themselves. In the final section, chunks are given as a 4-byte number length, the type of compression it uses, (almost always is zlib, RFC1950)
Here's more (probably better) information: https://minecraft.gamepedia.com/Region_file_format
The problem
I have a program that successfully loads chunk data. However, I'm not able to find how big the chunks will be when decompressed, and so I just use a maximum amount it could take when allocating space.
In the player data files, they do give the size that it takes when decompressed, and (I think) it uses the same type of compression.
The end of a player.dat file giving the size of the decompressed data (in little-endian):
This is the start of the chunk data, first 4 bytes giving how many bytes is in the following compressed data:
Mystery data
However, if I look where the compressed data specifically "ends", there's still a lot of data after it. This data doesn't seem to have a use, but if I try to decompress any of it with the rest of the chunk, I get an error.
Highlighted chunk data, and unhighlighted mystery data:
Missing decompressed size (header?)
And there's no decompressed size (or header? I could be wrong here) given.
The final size of this example chunks is 32,562 bytes, and this number (or any close neighbours) is nowhere to be found within the chunk data or mystery data. (Checked both big-endian, and little-endian)
Decompressed data terminating at index 32562, (Visual Studio locals watch):
Final Questions
Is there something I'm missing? Is this compression actually different from the player data compression? What's the mystery data? And am I stuck loading in 1<<20 bytes every time I want to load a chunk from a region file?
Thank you for any answers or suggestions
Files used
Isolated chunk data: https://drive.google.com/file/d/1n3Ix8V8DAgR9v0rkUCXMUuW4LJjT1L8B/view?usp=sharing
Full region data: https://drive.google.com/file/d/15aVdyyKazySaw9ZpXATR4dyvhVtrL6ZW/view?usp=sharing
(Not linking player data for possible security reasons)
In the region data, the chunk data starts at index 1208320 (or 0x127000)
The format information you linked is quite helpful. Have you read it?
In there it says: "The remainder of the file consists of data for up to 1024 chunks, interspersed with unused space." Furthermore, "Minecraft always pads the last chunk's data to be a multiple-of-4096B in length" (Italics mine.) Everything is in multiples of 4K, so the end of every chunk is padded to the next 4K boundary.
So your "mystery" data is not a mystery at all, as it is entirely expected per the format documentation. That data is simply junk to be ignored.
Note that, from the documentation, that the data "length" in the first three bytes of the chunk is actually one more than the number of bytes of data in the chunk (following the five-byte header).
Also from the documentation, there is indeed no uncompressed size provided in the format.
zlib was designed for streaming data, where you don't know ahead of time how much there will be. You can use inflate() to decompress into whatever buffer size you like. If there's not enough room to finish, you can either do something with that data and then repeat into the same buffer, or you can grow the buffer with realloc() in C, or the equivalent for whatever language you're using. (Not noted in the question or tags.)

Suggestions on to make my compressor faster

I have some data which I'm compressing with a custom compressor, the compressed data is fine but the compressor takes ages, and I'm seeking advice on how I could make that faster. Let me give you all the details.
The input data is an array of bytes, maximum 2^16 of them. Since those bytes in the array NEVER assume values between 0x08 and 0x37 (inclusive), I decided that I could exploit that for a simple LZ-like compression scheme that works by replacing any found sequence of 4 to 51 bytes in length that is already been found at a "lower address" (I mean closer to the array's beginning) with a single byte in the 0x08 to 0x37 range that would then be followed by two bytes addressing the low and high byte of the index of the beginning of the sequence, thus giving the decompressor the length (within that single byte) and address of the original data, to rebuild the original array.
The compressor works this way: for any sequence of any length from 51 to 4 bytes (I test longer sequences first) starting from any index (from left to right) I check if there's a correspondence 'left' of that, which means at an index which is lower than the starting point I'm checking. In case there is more than a single match, I choose the match that 'saves' more, which means the longer correspondence starting at the leftmost place.
The results are just perfect... but of course this is over-killing - it's 4 nested 'for' cycles with a memcmp() inside that, and it takes minutes on a modern workstation to compress some 20 KB worth of data, and that's why I'm seeking help.
Code is accessible here, if you need to sneak a peek. The 'job' starts at line 44.
Of course I can give you any detail you need, there's nothing secret here (BTW, just in case... I'm not going to change compression scheme for this reason, as this one works exactly as I need it!)
Thank you in advance.
A really obvious one is that you don't have to loop over the lengths, just find out what the longest match at that position is. That's not a "search", just keep extending the match by 1 for every matching pair of characters. When it stops, you have the longest match at that position (naturally you can force it to stop at 51 too, so it doesn't overrun).
An other typical trick is keeping a hashmap that maps keys of 3 or 4 characters to a list of offsets where they can be found. That way you only need to try positions that have some hope of resulting in a match. This is also described in the DEFLATE RFC all the way at the bottom.

why the first byte is like that in Beyond Compare tool?

I have two binary files and suppose they should be the same but they are not. So I use binary diff tools to look at them. But two different tools, Beyond Compare and UltraCompare, give me different result in one file at the first byte.
I use HxD tool to verify the content and it seems the HxD supports UltraCompare.
Can anybody tell me what's that mean in Beyond Compare? Does this mean the Beyond Compare is not reliable in some cases?
In Beyond Compare spaces with the cross hatched ▨ background indicate missing (added or deleted) bytes. In your image the file on the left starts with an an 0x00 byte that the one on the right doesn't have. BC will show a gap in the file content to make the rest of the bytes line up visually. That's also indicated the hex addresses that are shown as "line numbers" being different on the two sides, and is the reason the rest of the file shows as black (exact matches). Gaps don't have any affect on the content of the files, it's just a method of presenting the alignment more clearly.
UltraCompare apparently isn't adjusting the alignment in this case, so every 0xC8 byte is lined up with a 0x00 one and vice versa, which is why the entire comparison is shown as a difference (red).
HxD is just showing a single file, not a comparison, so it doesn't need to use gaps to show the alignment. Whether UltraCompare is better or not depends on what you want the comparison to do. It is just comparing byte 1 to byte 1, byte 2 to byte 2, etc, while BC is aligning the files taking into account adds and deletes. In this case, it's showing that byte 1 on the left was added, so doesn't match anything on the right, and that byte 2 on the left is the same as byte 1 on the left, byte 3 on the left matches byte 2 on the right, etc.
If the binary data can have inserts and deletes (e.g., if it contains textual strings or variable length headers), then BC's approach is better because it avoids showing the entire file as different if one side just has an added byte (as in this case).
If the binary data is fixed size, for example a bitmap, then what UltraCompare is doing is better, since it's not adjusting offsets to line things up better. Since your filenames are labeled "pixelData" I assume that's the behavior you would prefer. In that case, in Beyond Compare you can change that by using the Session menu's Session Settings... command and switching the "Comparison" alignment setting from "Complete" to "None".

How to determine the actual usage of a malloc'ed buffer

I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)
As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.
Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.
If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).
A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.
you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

Resources