Is GZIP compression output stable? - c

I need to store remotely some chunks of data and compare them too see if there are duplications.
I will compile a specific C program and I would like to compress this chuncks with GZIP.
My doubt is: if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?
Target PC/Servers could be with different Linux OSs like Ubuntu/CentOs/Debian, etc.
May I force same result by statically linking a specific gzip library?

if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?
While it may be true in the majority of the cases, I don't think you can safely make this assumption. The compressed output can differ depending on the default compression level and coding used by the library. For example the GNU gzip tool uses LZ77 and OpenBSD gzip uses compress (according to Wikipedia). I don't know if this difference comes from different libraries or different configurations of the same library, but nonetheless I would really avoid assuming that a generic chunk of gzipped data is exactly the same when compressed using different implementations.
May I force same result by statically linking a specific gzip library?
Yes, this could be a solution. Using the same version of the same library with the same configuration across different systems would give you the same compressed output.
You could also avoid this problem in other ways:
Perform the compression on the server, and only send uncompressed data (this is probably not a good solution as sending uncompressed data is slow).
Use hashes of the uncompressed data, store them on the server and check them by making the client send an hash first, and then the compressed data in case the server says the hash doesn't match (i.e. the chunk is not a duplicate). This also has the advantage of only needing to check the hash (and avoiding compression altogether if the hash matches).
Similar to option 2, use hashes of the uncompressed data, but always send compressed data to the server. The server then does decompression (which can be easily done in memory using a relatively small buffer) and hashes the uncompressed data to check if the received chunk is a duplicate before storing it.

No, not unless you are 100% certain you are using exactly the same version of the same source code with the same settings, and that you have disabled the modified timestamp in the gzip header.
It's not clear what you're doing with these compressed chunks, but if the idea is to have less to transmit and compare, then you can do far better with a hash. Use a SHA-256 on your uncompressed chunks, and then you can transmit and compare those in no time. The probability of an accidental match is so infinitesimally small, you'd have to wait for all the stars to go out to see one such occurrence.

May I force same result by statically linking a specific gzip library?
That's not enough, you also need the same compression level at the very least, as well as any other options your particular properties your library might have (usually it's just the level).
If you use the same version of the library and the same compression level, then it's likely that the output is identical (or stable, as you call it). That's not a very strong guarantee however, I'd recommend using a hashing function instead, that's what they're meant for.

Related

What is an efficient way of breaking data into chunks in C?

I've been searching for hours and my google-fu has failed me so thought I'd just ask. Is there an easy and efficient way of breaking data into small chunks in C?
For example. If I collect a bunch of info from somewhere; database, file, user input, whatever. Then maybe use a serialization library or something to create a single large object in memory. I have the pointer to this object. Let's say somehow this object ends up being like... 500 kb or something. If your goal was to break this down into 128 byte sections. What would you do? I would like a kind of general answer, whether you wanted to send these chunks over network, store them in a bunch of little files, or pass them through a looped process or something. If there is not a simple process for all, but if there does exist some for specific use cases, that'd be cool to know too.
What has brought this question about: I've been learning network sockets and protocols. I often see discussion about packet fragmentation and the like. Lots of talk about chunking things and sending them in smaller parts. But I can never seem to find what they use to do this before they move on to how they send it over the network, which seems like the easy part... So I started wondering how large data would manually be broken up into chunks to send small bits over the socket at a time. And here we are.
Thanks for any help!
Is there an easy and efficient way of breaking data into small chunks in C?
Data is practically a consecutive sequence of bytes.
You could use memmove to copy or move it and slice it in smaller chunks (e.g. of 1024 bytes each). For non-overlapping data, consider memcpy. In practice, a byte is often a char (perhaps an unsigned char or a signed char) but see also the standard uint8_t type and related types. In practice, you can cast void* from or to char* on Von Neumann architectures (like x86 or RISC-V).
Beware of undefined behavior.
In practice I would recommend organizing data at a higher level.
If your operating system is Linux or Windows or MacOSX or Android, you could consider using a database library such as sqlite (or indexed files à la Tokyo Cabinet). It is open source software, and doing such slicing at the disk level.
If you have no operating system and your C code is freestanding (read the C11 standard n1570 for the terminology) things are becoming different. For example, a typical computer mouse contains a micro-controller whose code is mostly in C. Look into Arduino for inspiration (and also RaspBerryPi). You'll have to then handle data at the bit level.
But I can never seem to find what they use to do this before they move on to how they send it over the network, which seems like the easy part...
You'll find lots of open source network code.
The Linux kernel has some. FreeRTOS has some. FreeBSD has some. Xorg has some. Contiki has some. OSdev links to more resources (notably on github or gitlab). You could download such source code and study it.
You'll find many HTTP (libonion, libcurl, etc...) or SMTP (postfix, vmime, etc...) related networking open source programs on Linux etc... And other network programs (PostGreSQL, etc...). Study their source code

How to make NFS support posix_fallocate?

My full text search engine store indexed data on NFS store.
Due to the frequently read/write ioes,I want to preallocate huge continuous disk space for each table file and so resort to posix_fallocate.
On an NFS volume,My little demo failed with "EOPNOTSUPP" responsed to posix_fallocate.
Does NFS protocal/specification include the posix_fallocate scenario?
(While the title of the question mentions posix_fallocate the definition of posix_fallocate means EOPNOTSUPP is not an error code it will return. It seems highly likely the questioner was actually using fallocate on Linux as that CAN return EOPNOTSUPP)
NFS can support fallocate on Linux (http://wiki.linux-nfs.org/wiki/index.php/Fallocate ) but:
Your client and server have to be using NFS 4.2 or later.
An NFS 4.2+ server doesn't HAVE to implement fallocate - it's optional (see the posting Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661] in a libc-alpha mailing list thread]).
Re posix_fallocate, you can never know whether a Linux glibc posix_fallocate call was real or emulated without careful observation or manual checking beforehand (e.g. by making a platform native fallocate call and checking if it failed or wasn't supported) and if you've done the native call why did you need the posix_fallocate call?. Further, different platforms have a different call for a native fallocate (or completely lack it) so if portability to non-Linux platforms is a concern you will need to write the appropriate native fallocate wrappers for each platform. If you have to do pre-allocation even when what's below you doesn't have a supported call for it, you will have to fall back to doing it by hand (e.g. via nice chunky writes or hacks like glibc's posix_fallocate...).
Note: When it is possible to do a real pre-allocation (e.g. via fallocate) it can be DRAMATICALLY faster than having to do full writes because the filesystem can fulfill it by just setting appropriate metadata.

DB vs. filesystems - for non-image files and endianess

I've read the many discussions about Databases vs. file systems for storing files. Most of these discussions talk about images and media files. My question is:
1) Do the same arguments apply to storing .doc, .pdf, .xls, .txt? Are there anything special about document files I should be aware of?
2) If I store in a database as binary, will there be endian issues if my host swaps machines? e.g., I insert into the database on a big-endian machine, it gets ported to an little-endian machine, then I try to extract (e.g., write to file, send it to my desktop, then try to open).
Thanks for any guidance!
1) Yes, pretty much the same arguments apply to storing PDFs and whatnot... anything that's compressed also comes to mind.
Every file format that's non-text has to deal with the question of endianness if it wants to be portable across hosts of different endianness. They mostly do it by defining what the endianness of all binary fields within the file that are longer than one byte should be. Software that writes and reads the format than has to take special care to byte-swap iff it's running on a platform of the opposite endianness. Images are no different than other binary file formats. The choice is arbitrary, but big endian (network byte order) is a popular choice especially with network software because of the ubiquity of macros in C that deal with this almost automatically.
Another way of defining binary file formats so that they are endian-portable is to support either endianness for binary fields, and include a marker in the header to say which one was used. On opening the file, readers consult the marker. That way the file can be read back slightly more efficiently on the same host where it was written or other hosts with the same endianness (which is the common case) while hosts of the opposite endianness need to expend a little bit more effort.
As for the database, assuming you are using a field type like a blob, you'll get back the very same bytestream when you read as whatever you wrote, so you don't have to worry about the endianness of the database client or server.
2) That depends on the database. The database might use an underlying on-disk format that is compatible with any endianness, by defining its on-disk format as described above.
Databases aren't often aiming for portability of their underlying file formats though, considering (correctly) that moving the underlying data files to a database host of different endianness is rare. According to this answer, for example, MySQL's MyISAM is not endian-portable.
I don't think you need to worry about this too much though. If the database server is ever switched to a host of different endianness, ensuring that the data remains readable is an important step of the process and the DBA handling the task (perhaps yourself?) won't forget to do it, because if they do forget, then nothing will work (that is, the breakage won't be limited to binary BLOBs!)

Zip on-the-fly compression library in C for streaming

Is there a library for creating zip files (the zip file format not gzip or any other compression format) on-the-fly (so I can start sending the file while it is compressing) for very large files (4 Gb and above).
The compression ratio does not matter much (mostly media files).
The library has to have a c-interface and work on Debian and OSX.
libarchive supports any format you want, on the fly and even in-memory files.
zlib supports compressing by chunks. you should be able to start sending a small chunk right after compressing it, while the library is still compressing the next chunk. (see this example)
(unfortunately, the file table is stored at the end of the zip file, so the file will be unusable until it is complete on the receiver side)
While this question is old and already answered I will note a new potential solution for those that find this.
I needed something very similar, a portable and very small library that created ZIP archives in a streaming fashion in C. Not finding anything that fit the bill I created one that uses zlib, available here:
https://github.com/CTrabant/fdzipstream
That code only depends on zlib and essentially provides a simple interface to creating ZIP archives. Most importantly (for me) the output can be streamed to a pipe, socket, whatever as the output stream does not need to be seek-able. The code is very small, a single source file and a header file. Works on OSX and Linux and probably elsewhere. Hope it helps someone beyond just me...

Reading complex binary file formats

Is there any book or tutorial that can learn me how to read binary files with a complex structure. I did a lot of attempts to make a program that has to read a complex file format and save it in a struct. But it always failed because of heap overruns etc. that made the program crash.
Probably your best bet is to look for information on binary network protocols rather than file formats. The main issues (byte order, structure packing, serializing and unserializing pointers, ...) are the same but networking people tend to be more aware of the issues and more explicit in how they are handled. Reading and writing a blob of binary to or from a wire really isn't much different than dealing with binary blobs on disk.
You could also find a lot of existing examples in open source graphics packages (such as netpbm or The Gimp). An open source office package (such as LibreOffice) would also give you lots of example code that deals with complex and convoluted binary formats.
There might even be something of use for you in Google's Protocol Buffers or old-school ONC RPC and XDR.
I don't know any books or manuals on such things but maybe a bunch of real life working examples will be more useful to you than a HOWTO guide.
One of the best tools to debug memory access problems is valgrind. I'd give that a try next time. As for books, you'd need to be more specific about what formats you want to parse. There are lots of formats and many of them are radically different from each other.
Check out Flavor. It allows you to specify the format using C-like structure and will auto-generate the parser for the data in C++ or Java.

Resources