deflateSetDictionary usage - zlib

It appears that deflateSetDictionary() can be called (a) only once with zlib formatted data, or (b) multiple times with the raw deflate format. Is my understanding correct?
With the raw option using multiple dictionary calls, how would compressor and decompressor agree on when to make the dictionary calls? I am imagining that they will agree in advance on some raw block size. Compressor will Z*FLUSH every block size then insert a dictionary, and decompressor will Z_BLOCK and insert a dictionary.
What is the typical use case (a) or (b)?
Here is the zlib.h excerpt https://github.com/madler/zlib/blob/master/zlib.h#L618

Correct.
It is entirely up to you when to insert dictionaries, and how to indicate that in the data stream. There is no typical use case. As far as I know, zlib dictionaries are rarely used.

Related

Deflate Format: differences between type blocks

I am currently trying to write a compressor and decompressor with the same purpose as the RFC Deflate specification.
I'm not able to understand the difference between how blocks are composed in the compression with fixed tables and dynamic tables. The file is processed by LZ77 generating (distance, length) + literal.
How do I know the type of block?
Do I have to compress this data?
Given that I use a fixed compression and don't have to send the tables, how would the encoder know how to encode data?
Moreover, do I have to send data before the actual compression executes?
I am confused on the difference between fixed tables and the table we send in the dynamic mode, and how the two blocks use them to encode data.
I'm currently reading Data Compression: The Complete Reference. Any advice will be helpful.
Since you are trying to compress, you would pick the smaller of the two. zlib's deflate computes what the size of a fixed block, a dynamic block, and a stored block would be, and emits the smallest of the three.
If you are encoding a fixed block, you encode using the fixed code for literal/lengths and distances. This code is provided in the RFC.

find and replace data on gzip content efficiently

my c linux based program inputs are:
char *in_str, char *find_str, char *replacing_str
the in_str is a compressed data (gzip).
the program needs to find for the find_str within the uncompressed input data, replace it with replacing_str, and then to recompress the data.
the trivial way to do so is by using one of the many available gzip compress/uncompress libraries to uncompress the data, manipulate the uncompressed data, and then to recompress the output. However i need to make it as efficient as possible (it is a RT program).
i wonder if it is more efficient to use an on-the-fly library (e.g. zlibc) approach or simply do the operation as described above.
maybe it is important to mention that:
the find_str and replacing_str strings are a small portion of the data
their lengths are not equal
the find_str supposed to appear about 4 or 5 times
the uncompressed data len is ~2K - 6K bytes
does anyone familiar with an efficient way to implement this?
Thanks
You are going to have to decompress no matter what, in order to search for the strings. (You might be able to get away with doing that only once and building an index. However that might be much larger than the uncompressed data, so you might as well just store it uncompressed instead.)
You can avoid recompressing all of it by preparing the gzip file ahead of time to be compressed in smaller historyless units using, for example, the Z_FULL_FLUSH option of zlib. This will reduce compression slightly depending on how often you do it, but will speed up building the output greatly if only one of many blocks need to be recompressed.

zlib compression big buffer vs. many small?

I'm compressing a data structure that has many fields. Which is a better approach, to use gzwrite to compress and write to file each of the fields, or write all of the fields to a buffer and compress that?
Separate calls of gzwrite won't make field compression separate: they'll be in a single compressed stream, as if you've written them with one call. If you wanted to gzclose and reopen in between, then there would be a difference.
(I think you know the tradeoffs for separate streams vs. single stream: with a single one, compression is better but you are unable to decompress only the fields you need. But again, there is no such tradeoff in your question: call gzwrite as it's convenient for you, the result will be the same).

Reading tag data for Ogg/Flac files

I'm working on a C library that reads tag information from music files. I've already got ID3v2 taken care of, but I can't figure out how Ogg files are structured.
I opened a .ogg file in a hexeditor and I could find the tag data because that was all human readable. But everything from the beginning of the file to the tag data looked like garbage. How is this data encoded?
I don't need any help in the actual code, I just need help visualizing what a Ogg header looks like and what encoding it uses so I that I can read it. I'd like to use a non-hacky approach to reading Ogg files.
I've been looking at the Flac format, which has been helpful.
The Flac file I'm looking at has about 350 bytes between the "fLac" identifier and the human readable Comments section, and none of it is human readable in my hex editor, so I'm sure there has to be something important in there.
I'm using Linux, and I have no intention of porting to Windows or OS X. So if I need to use a glibc only function to convert the encoding, I'm fine with that.
The Ogg file format is documented here. There is a very nice graphical visualization as you requested with a detailed written description.
You may also want to look at libogg which is a open source BSD-licensed library for reading and writing Ogg files.
As is described in the link you provided, the following metadata blocks can occur between the "fLaC" marker and the VORBIS_COMMENT metadata block.
STREAMINFO: This block has information about the whole stream, like sample rate, number of channels, total number of samples, etc. It must be present as the first metadata block in the stream. Other metadata blocks may follow, and ones that the decoder doesn't understand, it will skip.
APPLICATION: This block is for use by third-party applications. The only mandatory field is a 32-bit identifier. This ID is granted upon request to an application by the FLAC maintainers. The remainder is of the block is defined by the registered application. Visit the registration page if you would like to register an ID for your application with FLAC.
PADDING: This block allows for an arbitrary amount of padding. The contents of a PADDING block have no meaning. This block is useful when it is known that metadata will be edited after encoding; the user can instruct the encoder to reserve a PADDING block of sufficient size so that when metadata is added, it will simply overwrite the padding (which is relatively quick) instead of having to insert it into the right place in the existing file (which would normally require rewriting the entire file).
SEEKTABLE: This is an optional block for storing seek points. It is possible to seek to any given sample in a FLAC stream without a seek table, but the delay can be unpredictable since the bitrate may vary widely within a stream. By adding seek points to a stream, this delay can be significantly reduced. Each seek point takes 18 bytes, so 1% resolution within a stream adds less than 2k. There can be only one SEEKTABLE in a stream, but the table can have any number of seek points. There is also a special 'placeholder' seekpoint which will be ignored by decoders but which can be used to reserve space for future seek point insertion.
Just after the above description, there's also the specification of the format of each of those blocks. The link also says
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
So, what are you missing? You say
I'd like a non-hacky approach to reading Ogg files.
Why re-write a library to do that when they already exist?

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources