As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:
What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.
More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?
Related: What exactly causes binary file "gibberish"?.
Data, bits and bytes
Everyone who has had to buy hardware, or arrange a network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data: the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.
Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information, representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)
A byte is simply a grouping of bits in a standard size. Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning, and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.
Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".
How data is made useful
It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.
It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3}, we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3}, and one that tells us whether it's in {0, 2} or {1, 3}. If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic, to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16..., and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly, we can represent negative integers as well. If we let some bits correspond to binary fractions as well (1/2, 1/4, 1/8...), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.
Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.
Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second, we can represent sound audible to humans.
Similarly, having studied how the human eye works, we find that we can analyze colours as combinations of three intensity values (i.e., numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid), we can represent images. By considering images across time (a few tens of times per second), we can represent animations.
And so on, and so on.
Choosing an interpretation
There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?
Plainly, the raw data stored by a computer doesn't inherently represent anything specific. Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.
It just... isn't likely to appear like anything meaningful, that way.
However, the choice of interpretations is a choice... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata: data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).
What is a (file) format?
Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.
Put another way: a format is the meaning represented by some metadata.
A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)
(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)
A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding, i.e. the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding, and has become dominant.
What actually happens when we read the file?
Raw data is transferred from the file on disk, into the program's memory.
That's it.
Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text", and ignoring this can have all kinds of strange consequences.)
But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation, and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions. The solution - the Unicode standard, mentioned several times above - had its first release in 1991, but there are still a few programmers today blithely ignoring it.
But enough ranting.
How does interpreting a file work?
In order to display an image, render a web page, play sound or anything else from a file, we need to:
Have data that is actually intended to represent the corresponding thing;
Know the format that is used by the data to represent the thing;
Load the data (read the file, or read data from a network connection, or create the data by some other process);
Process the data according to the format.
This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen, or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).
Can someone please explain the difference between the LZSS and the LZ77 algorithm. I've been looking online for a couple of hours but I couldn't find the difference. I found the LZ77 algorithm and I understood its implementation.
But, how does LZSS differ from LZ77? Let's say if we have an string "abracadabra" how is LZSS gonna compress it differently from LZ77? Is there a C pseudo-code that I could follow?
Thank you for your time!
Unfortunately, both terms LZ77 and LZSS tend to be used very loosely, so they do not really imply very specific algorithms. When people say that they compressed their data using an LZ77 algorithm, they usually mean that they implemented a dictionary based compression scheme, where a fixed-size window into the recently decompressed data serves as the dictionary and some words/phrases during the compression are replaced by references to previously seen words/phrases within the window.
Let us consider the input data in the form of the word
abracadabra
and assume that window can be as large as the input data. Then we can represent "abracadabra" as
abracad(-7,4)
Here we assume that letters are copied as is, and that the meaning of two numbers in brackets is "go 7 positions back from where we are now and copy 4 symbols from there", which reproduces "abra".
This is the basic idea of any LZ77 compressor. Now, the devil is in the detail. Note that the original word "abracadabra" contains 11 letters, so assuming ASCII representation the word, it is 11 bytes long. Our new representation contains 13 symbols, so if we assume the same ASCII representation, we just expanded the original message, instead of compressing it. One can prove that this can sometimes happen to any compressor, no matter how good it actually is.
So, the compression efficiency depends on the format in which you store the information about uncompressed letters and back references. The original paper where the LZ77 algorithm was first described (Ziv, J. & Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3), 337-343) uses the format that can be loosely described here as
(0,0,a)(0,0,b)(0,0,r)(0,1,c)(0,1,d)(0,3,a)
So, the compressed data is the sequence of groups of three items: the absolute (not relative!) position in the buffer to copy from, the length of the dictionary match (0 means no match was found) and the letter that follows the match. Since most letters did not match anything in the dictionary, you can see that this is not a particularly efficient format for anything but very compressible data.
This inefficiency may well be the reason why the original form of LZ77 has not been used in any practical compressors.
SS in the "LZSS" refers to a paper that was trying to generalize the ideas of dictionary compression with the sliding window (Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29(4), 928-951). The paper itself looks at several variations of dictionary compression schemes with windows, so once again, you will not find an explicit "algorithm" in it. However, the term LZSS is used by most people to describe the dictionary compression scheme with flag bits, e.g. describing "abracadabra" as
|0a|0b|0r|0a|0c|0a|0d|1-7,4|
where I added vertical lines purely for clarity. In this case numbers 0 and 1 are actually prefix bits, not bytes. Prefix bit 0 says "copy the next byte into the output as is". Prefix bit 1 says "next follows the information for copying a match". Nothing else is really specific, term LZSS is used to say something specific about the use of these prefix signal bits. Hopefully you can see how this can be done compactly, in fact much more efficiently than the format described in LZ77 paper.
I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork
Two days ago i attended an interview.I had been asked a question and i am still finding answer.The Question Was tell me the test cases of atof(const char *str) function in c.I told him various test cases like
I have to check the given string should contain only numeric.
Given string contain one decimal point.
it should not overflow after conversion.
string should not be null.
but interviewer was not satisfied and asking for give me the answer in structured format.now my question is how to represent this answer in structure format so that in future i could not make same mistake.
I'm not sure what the interviewer means by "structured format", but I would do this by writing down the BNF syntax for floating point numbers (the C language specifies them), and then presenting test cases that test for each path through the syntax. Your cases notably do not cover the sign or exponent, and the number need not contain a decimal point.
A structural approach breaks the problem down into subproblems. Syntax is one subproblem, and the syntax chart or BNF provides a natural way to break that down into subproblems. An additional subproblem is boundary conditions ... there should be test cases for the minimum (> 0) and maximum valid values. There should also be test cases for handling of invalid inputs, but as lundin noted in a comment, that's impossible for atof as the behavior for invalid inputs is undefined.
Maybe you can structure your answer by what you are testing, like giving bad formated string (null, empty, etc ...) and by giving bad arguments like bad "numbers" (0 prefix/suffix 2.0, 0.4 etc...) you can also tests negative float numbers, put more than one dot in the string or whatever. I hope i have answer your question, if not, i think i haven't understood the question well.
I understand the term "test cases" differently than you.
I think what he wants are various inputs to atof and their expected results. For example:
1. atof("1.5") should return 1.5.
2. atof("-7") should retutn -7.0.
3. atof("Hello, world") should fail. But following Lundin's comment, there's no defined failure behavior for atof, so you can't really test this.
The test cases should cover all the different things the function needs to test. But you don't need to write down these things - just the example inputs and expected outputs.
Writing this in a structured format is easy.
We used use atof in our code most of the time we need to handle Internationalization/Localization in many languages 10.0 get converted to 10,0.
before calling atof you need to set locale and after completing the functionality you have to reset the locale.
This is to extend the question: Tools to help reverse engineer binary file formats
Are there any tools that are publicly available that uses clustering and/or data mining techniques to reverse engineer file formats?
For example, with the tool you would have a collection of files that have the same format and the output of the tool would be the generic structure?
If one had a truly efficient binary encoding format (ZIP files are an example), then the information content in each bit is high. Essentially, it will look like a perfect random number.
You can't infer anything from that without additional knowledge.
If the binary encoding isn't efficient, in theory, you have some faint chance of seeing structure. But this still sounds really hard; how do you even begin guessing where the boundaries of fields are?
The AI machine learning types will tell you, you can't learn anything unless you already "almost" know it. Often they succeed by encoding the the problem with problem-tokens that at least you can reason about.
I don't think you can do this without providing more information. Do you know anything about the file formats? Field sizes are always less than N bits? Only ASCII strings are encoded or vice versa?