Library to compress text data and store it as text - zlib

I want to store web pages in compressed text files (CSV). To achieve the optimal compression, I would like to provide a set of 1000 web pages. The library should then spend some time creating the optimal "dictionary" for this content. One obvious "dictionary" entry could be <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">, which could get stored as %1 or something like that because it is present on almost all web pages. By creating a customized dictionary like this, the compression rates should be 99% in my case.
My question is, does a library for doing this exist on Windows with MIT or similar liberal licensing exist? If not, are there any general purpose compression libaries you would recommend. I have tried a bit with zlib, but it outputs binary data. If I would convert this binary data into text, I am worried that the result might be longer than the original text.
EDIT: I need to be able to store the text in CSV files and still be able to import them into a database or even Excel.

"text files (not binary)" is a little too general. If you mean that some
byte values (00,1A or whatever) can't be used, then any binary method +
something like base64 coding can be used. (Although I'd suggest a more efficient method
from Coroutine demo source).
To be specific, you can use any general-purpose compressor to compress your
base file, then base file + target file, then diff these, and you'd get
a dictionary compression (binary), which can be then converted to "text"
with base64 or yenc or whatever.
Alternatively, there're some coders with build-in support for that, for example
http://compression.ru/ds/ppmtrain.rar
http://code.google.com/p/lzham/
If you actually want to have common phrases replaced with references, and
all other things left untouched (what is kinda implied, but not equals to "text output"),
you can use text preprocessors like:
http://xwrt.sourceforge.net/
http://compression.ru/ds/liptify.rar
(There were more afair).
Also a hybrid method is possible. You can use a general-purpose LZ compressor like in [1], for example lzma, then replace its entropy coding with something text-based.
For example, in http://nishi.dreamhosters.com/u/lzmarec_v1_bin.rar
there's an utility which removes LZMA's entropy coding, and its pretty easy to convert
its output to text.

Related

howto to use/configure win10 explorer's "search" to actually let me find programming artefacts?

I find the default behaviour of Exporer's search funtionality so annoying that I cannot help but wondering what this punishment is meant for.
My typical usecase is that I want to find all the files that some literal text appears in (considering filenames as well as content). One would think that it cannot get any simpler than that! I want NO restriction on the file types and just search for a bloody text like "foo.h" or "sin(". I do NOT want to use extra prefixes (like "name:=foo.h" which works for filename but not for the content).
Is there a way to configure Explorer's default behavior to do just that (I'd prefer to completely turn off the "smart" garbage and put it in a simple full text search mode - I don't care about what useless indexing windows may or may not have done.. my SSD is fast enough to rescan all the raw data that I want to search)? (or do I have to keep using 3rd party tools that actually know what a bloody plain text search is supposed to do?! what would be the best tool to use as a workaround - if necessarry?)

a clear understanding of file, file encoding, file format

I lack a clear understanding of the concepts of file, file encoding and file format. Google helped up to a point.
From what I understand so far, all the files are binary, i.e., each byte in such a file can contain any of the 256 possible strings of bits. ASCII files (and here's where we get to the encoding part) are a subset of binary files, where each byte uses only 7 bits.
And here's where things get mixed up. A file format seems to be a way to interpret the bytes in a file, and file extensions seem to be one of the most used ways of identifying a file format.
Does this mean there are formats defined for binary files and formats defined for ASCII files? Are formats like xml, pdf, doc, rtf, html, xls, sql, tex, java, cs "referring" to ASCII files? Whereas formats like jpg, mp3, avi, eps, obj, out, dll are a clue that we're talking about binary files?
I don't think you can talk about ASCII and BINARY files, but TEXT and BINARY files.
In that sense, these are text files: XML, HTML, RTF, SQL, TEXT, JAVA, CSS, EPS.
And these are binary files: PDF, DOC, XLS, JPG, MP3, AVI, OBJ, DLL.
ASCII is just a table of characters used in the beginning of computing to represent text, but its is nowadays somewhat discouraged since it can't represent text in languages such as Chinese, Arabic, Spanish (word with ñ, Ñ, tildes), French and others. Nowadays other CHARACTER REPRESENTATIONS are encouraged instead of ASCII. The most well known is probably UTF-8. But there are others like ISO-8859-1, ISO-8859-3 and such. Take a look at this article by Joel Spolsky talking about UNICODE. It's very enlightening.
File formats are just another very different issue. File formats are protocols which programs agree on, to represent information. In that sense, a JPG file is an image that has a certain (well know) internal format that allows programs (Browsers, Spreadsheets, Word Processors) to use them as images.
Text files also have formats (I.E., there are specifications for text files like XML and HTML). Its format, as in JPG and other binary files permits applications to use them in a coherent and specific way to achieve something: I.E., render a WEB PAGE (HTML and XHTML file format).
The actual way the file is stored on the hard-drive is defined by the OS. The actual content of the file can be described as array of bytes - each one has up to a byte size possible values.
Text files - will use either the 256 char (ASCII) set - and then you can read them easily or a wider char set - in that case - only suitable apps can read it.
The rest - what you might call binary (and any other formats which is "unreadable" by "text" viewers) - are formats that designed to be read by a certain other apps or the OS.
if it's executable - the OS can read them and execute, others - like jpg - designed to be "understand" by photo viewers ect....
This is an old question but still very relevant. I was confused by this as well, and asked around for clarification. Here's the summary (hope it helps someone):
Format: File/record format is the way data is represented. You might use CSV, TSV, JSON, Apache Log format, Thrift format, Protobuf format etc to represent your data. Format is responsible for ensuring the data is structured properly and correctly represented. Ex: when you read a json file, you should have nested key-value pairs; that's the guarantee always present.
{
"story": {
"title": "beauty and the beast"
}
}
Encoding: Encoding basically transforms your data (in any format or plain text) to a specific scheme. Now, what is this scheme? Scheme is specific to the purpose of encoding. Example, while transferring data over wire (internet), we would want to make sure the above example json reach the other side correctly, should not be corrupted. To ensure this, we would add some meta info like checksum that can be used to verify data's correctness. Other usage of encoding involve shortening data, exchanging secret etc.
Base64 encoding of above JSON example:
ew0KICAgICAgICAic3RvcnkiOiB7DQogICAgICAgICAgICAidGl0bGUiOiAiYmVhdXR5IGFuZCB0aGUgYmVhc3QiDQogICAgICAgIH0NCn0=
I think it is worth noting that with media files, mpeg and others are a form of media codecs. They explain how digital data can express visual and audio. They are generally housed in a media file container such as an avi file which is really a riff file type that is for media.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources