a clear understanding of file, file encoding, file format - file

I lack a clear understanding of the concepts of file, file encoding and file format. Google helped up to a point.
From what I understand so far, all the files are binary, i.e., each byte in such a file can contain any of the 256 possible strings of bits. ASCII files (and here's where we get to the encoding part) are a subset of binary files, where each byte uses only 7 bits.
And here's where things get mixed up. A file format seems to be a way to interpret the bytes in a file, and file extensions seem to be one of the most used ways of identifying a file format.
Does this mean there are formats defined for binary files and formats defined for ASCII files? Are formats like xml, pdf, doc, rtf, html, xls, sql, tex, java, cs "referring" to ASCII files? Whereas formats like jpg, mp3, avi, eps, obj, out, dll are a clue that we're talking about binary files?

I don't think you can talk about ASCII and BINARY files, but TEXT and BINARY files.
In that sense, these are text files: XML, HTML, RTF, SQL, TEXT, JAVA, CSS, EPS.
And these are binary files: PDF, DOC, XLS, JPG, MP3, AVI, OBJ, DLL.
ASCII is just a table of characters used in the beginning of computing to represent text, but its is nowadays somewhat discouraged since it can't represent text in languages such as Chinese, Arabic, Spanish (word with ñ, Ñ, tildes), French and others. Nowadays other CHARACTER REPRESENTATIONS are encouraged instead of ASCII. The most well known is probably UTF-8. But there are others like ISO-8859-1, ISO-8859-3 and such. Take a look at this article by Joel Spolsky talking about UNICODE. It's very enlightening.
File formats are just another very different issue. File formats are protocols which programs agree on, to represent information. In that sense, a JPG file is an image that has a certain (well know) internal format that allows programs (Browsers, Spreadsheets, Word Processors) to use them as images.
Text files also have formats (I.E., there are specifications for text files like XML and HTML). Its format, as in JPG and other binary files permits applications to use them in a coherent and specific way to achieve something: I.E., render a WEB PAGE (HTML and XHTML file format).

The actual way the file is stored on the hard-drive is defined by the OS. The actual content of the file can be described as array of bytes - each one has up to a byte size possible values.
Text files - will use either the 256 char (ASCII) set - and then you can read them easily or a wider char set - in that case - only suitable apps can read it.
The rest - what you might call binary (and any other formats which is "unreadable" by "text" viewers) - are formats that designed to be read by a certain other apps or the OS.
if it's executable - the OS can read them and execute, others - like jpg - designed to be "understand" by photo viewers ect....

This is an old question but still very relevant. I was confused by this as well, and asked around for clarification. Here's the summary (hope it helps someone):
Format: File/record format is the way data is represented. You might use CSV, TSV, JSON, Apache Log format, Thrift format, Protobuf format etc to represent your data. Format is responsible for ensuring the data is structured properly and correctly represented. Ex: when you read a json file, you should have nested key-value pairs; that's the guarantee always present.
{
"story": {
"title": "beauty and the beast"
}
}
Encoding: Encoding basically transforms your data (in any format or plain text) to a specific scheme. Now, what is this scheme? Scheme is specific to the purpose of encoding. Example, while transferring data over wire (internet), we would want to make sure the above example json reach the other side correctly, should not be corrupted. To ensure this, we would add some meta info like checksum that can be used to verify data's correctness. Other usage of encoding involve shortening data, exchanging secret etc.
Base64 encoding of above JSON example:
ew0KICAgICAgICAic3RvcnkiOiB7DQogICAgICAgICAgICAidGl0bGUiOiAiYmVhdXR5IGFuZCB0aGUgYmVhc3QiDQogICAgICAgIH0NCn0=

I think it is worth noting that with media files, mpeg and others are a form of media codecs. They explain how digital data can express visual and audio. They are generally housed in a media file container such as an avi file which is really a riff file type that is for media.

Related

How are the binary data inside a certain format parsed?

Considering a binary data (video/images/audio/executable) can be regarded as a long sequence of random bytes,
when the data is inside a special format (SQL, BOLB in database, MP3, JSON, XML etc), how does the parser know that a special char(or sequence of chars, like {,},\t,space,EOF) is used in formatting, not a part of the binary data and vice versa?
Also, I am not quite sure which category this question fits in, so I put lexical analysis and linguistics. What subject/fields of computer science studies this?
This is indeed an odd place for this question. I'm a little unclear on exactly what you're asking here, but in sum, not all binary data (assuming you mean machine readable data) are equal. For instance: audio, images, and video are not executable data, they are parsed data; as such they are handled differently.
Also, "binary data" are not as arbitrary as you might think upon opening a hex editor for the first time :). Executables are structured into DATA and CODE segments, so with those flags the computer knows how to treat things appropriately. As for the other three types you mentioned, they are all structured differently depending upon their file format, which is why so many different file formats are out there! The executable program which parses these files knows how to handle them based upon information contained in the code about the file format, which of course means that the program has to know how to handle the file format and have info on how it is segmented to load it properly, which is why you can't open an MP3 in Microsoft Paint.
As for the study of file formats and data storage, that has applications in a lot of areas, it's not really a field unto itself so much as a topic that comes up in a lot of areas. Information Theory, Reverse Engineering, Natural Language Processing, and many others have uses for understanding different file types and how they store data. Anyhow, this was only a brief, cursory explanation, and there's plenty of things you can google (try .exe file formats or .jpg/.png file formats to start).

How do different file types differ?

So far, I understood:
Files have some information in their 'headers' which programs use to detect the file type.
Is it always true? If so how can I see the header?
Is it the only way of defining a file's type?
Is there a way to manually create an empty file (at least in Linux) with no headers at all?
If so, can I manually write its header and create a simple 'jpg' file
No, files simply have bytes and some metadata like a filename, permissions, last modified time. The format of those bytes is completely free and open without convention. Certainly some file types like jpegs,gifs,audio and video files have headers specified in their formats. Viewing a header depends completely on the format involved. They will normally be comprised of byte codes meaningless to the human eye so some software would normally be required to decode and view them.
yes.
touch emptyFile
Sounds painful. Use a library to write a jpeg. Headers are not necessarily easy to create. Someone else has done this hard work for you so I'd use it.
A file is nothing more than a sequence of bytes, and it has no default, internal structure. It is an abstraction made by the OS to make it more convenient to store and manipulate data.
Files may represent different types of things like images, videos, audio and plain text, so these need to be interpreted in a certain way in order to interact with their contents. For instance; an image is opened in an image viewer, a PDF document is opened in a PDF viewer; an audio file is opened in a media player. That doesn't mean you cannot open an image in a text editor - the file's contents will only be interpreted differently.
The closest thing to file metadata in UNIX and Linux is the inode - which stores information about files, but are not part of the files themselves - and the files's magic number. Use stat to inspect the inode and use file to determine its type (sometimes based on its magic number).
Also check out man file for more information about file types.

Library to compress text data and store it as text

I want to store web pages in compressed text files (CSV). To achieve the optimal compression, I would like to provide a set of 1000 web pages. The library should then spend some time creating the optimal "dictionary" for this content. One obvious "dictionary" entry could be <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">, which could get stored as %1 or something like that because it is present on almost all web pages. By creating a customized dictionary like this, the compression rates should be 99% in my case.
My question is, does a library for doing this exist on Windows with MIT or similar liberal licensing exist? If not, are there any general purpose compression libaries you would recommend. I have tried a bit with zlib, but it outputs binary data. If I would convert this binary data into text, I am worried that the result might be longer than the original text.
EDIT: I need to be able to store the text in CSV files and still be able to import them into a database or even Excel.
"text files (not binary)" is a little too general. If you mean that some
byte values (00,1A or whatever) can't be used, then any binary method +
something like base64 coding can be used. (Although I'd suggest a more efficient method
from Coroutine demo source).
To be specific, you can use any general-purpose compressor to compress your
base file, then base file + target file, then diff these, and you'd get
a dictionary compression (binary), which can be then converted to "text"
with base64 or yenc or whatever.
Alternatively, there're some coders with build-in support for that, for example
http://compression.ru/ds/ppmtrain.rar
http://code.google.com/p/lzham/
If you actually want to have common phrases replaced with references, and
all other things left untouched (what is kinda implied, but not equals to "text output"),
you can use text preprocessors like:
http://xwrt.sourceforge.net/
http://compression.ru/ds/liptify.rar
(There were more afair).
Also a hybrid method is possible. You can use a general-purpose LZ compressor like in [1], for example lzma, then replace its entropy coding with something text-based.
For example, in http://nishi.dreamhosters.com/u/lzmarec_v1_bin.rar
there's an utility which removes LZMA's entropy coding, and its pretty easy to convert
its output to text.

How do you create a file format?

I've been doing some reading on file formats and I'm very interested in them. I'm wondering what the process is to create a format. For example, a .jpeg, or .gif, or an audio format. What programming language would you use (if you use a programming language at all)?
The site warned me that this question might be closed, but that's just a risk I'll take in the pursuit of knowledge. :)
what the process is to create a format. For example, a .jpeg, or .gif, or an audio format.
Step 1. Decide what data is going to be in the file.
Step 2. Design how to represent that data in the file.
Step 3. Write it down so other people can understand it.
That's it. A file format is just an idea. Properly, it's an "agreement". Nothing more.
Everyone agrees to put the given information in the given format.
What programming language would you use (if you use a programming language at all)?
All programming languages that can do I/O can have file formats. Some have limitations on which file formats they can handle. Some languages don't handle low-level bytes as well as others.
But a "format" is not an "implementation".
The format is a concept. The implementation is -- well -- an implementation.
You do not need a programming language to write the specification for a file format, although a word processor might prove to be a handy tool.
Basically, you need to decide how the information of the file is to be stored as a sequence of bits. This might be trivial, or it might be exceedingly difficult. As a trivial example, a very primitive bitmap image format could start with one unsigned 32-bit integer representing the width of the bitmap, and then one more such integer representing the height of the bitmap. Then you could decide to simply write out the colour of the pixels sequentially, left-to-right and top-to-bottom (row 1 of pixels, row 2 of pixels, ...), using 24-bits per pixel, on the form 8 bits for red + 8 bits for green + 8 bits for blue. For instance, a 8×8 bitmap consisting of alternating blue and red pixels would be stored as
00000008000000080000FFFF00000000FFFF0000...
In a less trivial example, it really depends on the data you wish to save. Typically you would define a lot of records/structures, such as BITMAPINFOHEADER, and specify in what order they should come, how they should be nestled, and you might need to write a lot of indicies and look-up tables. Myself I have written quite a few file formats, most recently the ASD (AlgoSim Data) file format used to save AlgoSim structures. Such files consists of a number of records (maybe nestled), look-up tables, magic words (indicating structure begin, structures end, etc.) and strings in a custom-defined format. One typical thing that often simplifies the file format is that the records contain data about their size, and the sizes of the custom data parts following the record (in case the record is some sort of a header, preceeding data in a custom format, e.g. pixel colours or sound samples).
If you havn't been working with file formats before, I would suggest that you learn a very simple format, such as the Windows 3 Bitmap format, and write your own BMP encoder/decoder, i.e. programs that creates and reads BMP files (from scratch), and displays the read BMP files. Then you now the basic ideas.
Fundamentally, files only exist to store information that needs to be loaded back in the future, either by the same program or a different one. A really good file format is designed so that:
Any programming language can be used to read or write it.
The information a program would most likely need from the file can be accessed quickly and efficiently.
The format can be extended and expanded in the future, without breaking backwards compatibility.
The format should accommodate any special requirements (e.g. error resiliency, compression, encoding, etc.) present in the domain in which the file will be used
You are most certainly interested in looking into Protocol Buffers and Thrift. These tools provide a modern, principled way of designing forwards and backward compatible file formats.

How are file formats created? If it is all binary how does encoding change the file type?

I have read a few links on the topic of file formats and encoding, but how is it done?
If all data is binary, what splits data into different file formats? What exactly does encoding the data involve? How is it done?
As per theatrus' response, it's all a matter of interpretation.
Typically the file extension (.txt, .jpg, .pdf etc.) provides enough information to determine which program should handle the file - and then the program will know how to handle the format it's given (or produce this format when saving to that particular file type).
Each file format has a (hopefully!) well defined format, for example a PDF file will always start with a line that reads "%PDF-x.y" where x.y is the version number e.g. 1.6. which enables the likes of Acrobat to determine that this 'is most likely a PDF file' and to decide how to handle it (different versions will have different internal structures).
.txt files are usually just sequences of 'characters' encoded in a particular way - plain English text is easily encoded, more complex languages with thousands of characters require more complex encodings (Unicode, or UTF-8, the latter being a 'compressed' form of Unicode).
Try opening up a few non-critical files in a hex-editor and get your hands on some format specifications and see what you can find!
File formats describe data in a specific representation. For example, jpeg, bmp, png and tiff all describe images whereas html and rtf describe text documents.
A file format consists of a header that describes information about the contained data (image dimensions, compressed file name, etc.). These will contain identifying signatures that mark the file being a specific type:
Windows executables start with 'MZ'
jpeg images have JFIF in the first 20 bytes or so (can't remember the exact offset)
HTML documents have <html (upper or lower case) near the start of the document
This is the concept behind the unix file command and libmagic API.
Text encoding is what character set the text is encoded in. This is because programs historically use single-byte arrays (char * in C/C++) to represent strings and that is not enough to represent most human languages. The text encoding says that "this text is Simplified Chinese", or "this text is Cyrillic".
How text encodings are selected depends on the file format being used. Plain text formats (text, html, xml) can have a "byte-order-mark" at the beginning that identifies that text as UTF-32 (little endian or big endian), UTF-16 (little endian or big endian), or UTF-8. These are different representations of Unicode characters.
XML allows you to specify the encoding in the <?xml?> declaration -- e.g. <?xml version="1.0" encoding="ShiftJIS"?>. HTML allows you to specify the encoding in a <meta> tag -- e.g. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">.
You can see examples where text is encoded in one form, but decoded as another (the text is mangled) in some emails or other places. These will look like • (which is a bullet character (middle black dot) encoded in utf-8) -- you can see this in firefox by going to the View > Character encoding menu and changing the encoding to Western (ISO-8859-1) (especially for non-Western characters).
You can also have other types of encoding. For example, email can be wrapped in base64 during transport.
The main ways to decide what format something is are by file extension or by MIME type - and less frequently by "magic numbers".
The file extension will be checked by an OS or Application to decide what to do with it (which app to run it in, or which part of code to execute for it).
MIME types are used where an extension (or filename) isn't always applicable - for example, when downloading a file over HTTP, the URI for a file might be something like ~.php?id=12973. The filetype cannot be determined from ths alone, but the HTTP protocol will send a "Content-Type" definition to say what format the file is, and the browser will handle it correctly. eg: a Content-Type: image/png would force the browser to pass the file to some PNG decoding function.
When the application knows what the file format is, it'll pass the data to code which is written specifically for that format. If the program doesn't have code to read a format, it will fail to read it.
How a file is encoded is specific to the file. Most standard formats will have a specification to describe their binary encoding, and any application reading that file type must implement code to match the specification. (Although this is usually done by using a library which already does the reading for you).
To give an example of how binary encodings work, consider an image. The specification might say that bytes 10-13 signify the width of the image, and bytes 14-17 signify the height of the image. In order to read those pieces of the information from the file, the code must explicitly read the correct size data at the correct locations indicated by the spec. EG: fseek(f, 10, SEEK_SET); fread(&width, 4, 1, f); //Read 4 bytes at location 10 into "width"). I think your confusion is "what separates pieces of data in binary files?" (ie, in text files, this can be done by new lines, spaces, comma-separated values (CSV), etc). The answer is: usually the size of the data will determine where it ends - a specification will say what the binary type of each field is (perhaps it may say int32, indicating 32 bits/4 bytes).
Other than that, there can be ambiguities in file formats, but usually happens with text files, where the text inside can be read to determine the format. This isn't always applicable, because often a text file will simply have the extension ".txt", so it can be unknown to the application what the character encoding of the text is. (This was, and still is a problem for applications which do not use unicode).
All data is binary, including this web page you are viewing right now. Its the interpretation of the data that matters.
For instance, pretend you have four bytes:
0xaa 0x00 0x00 0x55
That could be (in no particular order):
The number 43520 followed by the number 85
The decimal number 170 followed by 21760
The decimal number 2852126805
Hundreds of other interpretations
And this is only the unsigned numbers. Any of those bytes or bits could be markers, order indicators, strings, position indicators, etc.

Resources