Related
As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:
What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.
More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?
Related: What exactly causes binary file "gibberish"?.
Data, bits and bytes
Everyone who has had to buy hardware, or arrange a network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data: the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.
Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information, representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)
A byte is simply a grouping of bits in a standard size. Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning, and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.
Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".
How data is made useful
It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.
It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3}, we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3}, and one that tells us whether it's in {0, 2} or {1, 3}. If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic, to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16..., and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly, we can represent negative integers as well. If we let some bits correspond to binary fractions as well (1/2, 1/4, 1/8...), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.
Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.
Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second, we can represent sound audible to humans.
Similarly, having studied how the human eye works, we find that we can analyze colours as combinations of three intensity values (i.e., numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid), we can represent images. By considering images across time (a few tens of times per second), we can represent animations.
And so on, and so on.
Choosing an interpretation
There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?
Plainly, the raw data stored by a computer doesn't inherently represent anything specific. Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.
It just... isn't likely to appear like anything meaningful, that way.
However, the choice of interpretations is a choice... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata: data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).
What is a (file) format?
Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.
Put another way: a format is the meaning represented by some metadata.
A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)
(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)
A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding, i.e. the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding, and has become dominant.
What actually happens when we read the file?
Raw data is transferred from the file on disk, into the program's memory.
That's it.
Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text", and ignoring this can have all kinds of strange consequences.)
But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation, and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions. The solution - the Unicode standard, mentioned several times above - had its first release in 1991, but there are still a few programmers today blithely ignoring it.
But enough ranting.
How does interpreting a file work?
In order to display an image, render a web page, play sound or anything else from a file, we need to:
Have data that is actually intended to represent the corresponding thing;
Know the format that is used by the data to represent the thing;
Load the data (read the file, or read data from a network connection, or create the data by some other process);
Process the data according to the format.
This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen, or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).
I have recently crossed the term encoding. I learned encoding is used to standardize different characters. Databases also encode data to standardize the table data. My question is If my column only contains decimal numbers still encoding is needed? , Is encoding have anything to do with the memory size of the data?.
With the term encoding, we means putting a semantic value in a binary representation (binary: for most modern computer context; in theory encoding is nor restricted to computer bits/bytes).
The computers works differently than us, so we need to encode everything, so that computer can interpret the numbers. Sometime the encoding is implicit, or sometime outside CPU knowledge.
On a strict form of encoding (now out of fashion) when you decide the size of an integer (and whenever it is signed or unsigned), you are choosing (partially) an encoding (some part is still often implicit: type of representation of signed characters, endness of numbers, etc.). As long you are not reading the binary file of database, these doesn't matter much, but the size change the amount of memory/disk used. (note, we tend to call this type, not encoding, for numbers)
But as you see, now encoding is often used for strings (since tens of years we do care much about encoding for numbers). With this definition (and if you see "Unicode" or "UTF-8" or something similar, this mean that we are discussing just for string encoding), it doesn't matter the encoding of a number, but just the type.
But than there are locales (e.g. internationalization): you may need a specific encoding, so that e.g. thousand separator will be printed correctly, but this may be specified as locale/collation, or as database encoding, or in the client (which is the program in charge to display data).
So: for memory and disk (database engine side), the encoding (as string encoding) of a decimal and floats doesn't matter, just select the right type. For blobs and strings, the (string) encoding matter (but also for this case, you may need to check what are the available encodings on the engine: sometime the engine will use few encodings, and transform to other on software, so no changes on disk/memory). Note: numbers are numbers, so it doesn't matter much, but for string, many SQL functions depends on encoding and locale (so on "server side", like LIKE keyword, sorting, etc.).
Your first question:If column contains decimal numbers still encoding is needed?
Answer : Columns in a table can and contain any data whether it is in form of decimal numbers or any other.
Every data no matter it is in which form either decimal,floating point,characters etc needs encoding to make it secure.Especially when you work with sensitive data or are looking for a way to organize your data in an archive you should consider encoding your data.
Your second question:Is encoding have anything to do with the memory size of the data?.
Answer : Since encoding removes redundancies from data, the size of your files will be a lot smaller. This results in faster input speed when data is saved.
Encoded data is smaller in size, you should be able to save space on your storage devices. This is ideal if you have large amounts of data that need to be archived.
I am learning FileIO in C and was little confused with the binary files. My question is what is the use of having binary files when we can always use files in ASCII or someother format which can be easily understandable. Also in what applications are binary files more useful?
Any help on this really appriciated.Thanks!
All files are binary in nature. ASCII files are those subset of binary files that contain what can be considered to be 'human-readable' data. A pure binary file is not constrained to that subset of characters that is readable.
Speed of access
Obfuscation
The ability to write native objects to file without creating big serialised files.
ASCII is easily understandable by humans, but for many other purposes, it's more efficient and easier for the computer to store things in a binary format. For example, if you want to keep a sequence of integers, it's easier for the computer to read/write the 4 bytes it takes to represent an int, than it is to write out the ascii representation of the number, then parse it while reading.
It is critically important that any byte value can be stored, for example programs are binary. Any possible binary code may be a program instruction for the CPU.
ASCII only stores 7-bit values, so there are half the possible values wasted.
Further, what would an integer be stored as?
The number 4294967295 can be stored in 4 bytes, 32 bits, but if it were stored in ASCII, as a number, it would require 10 characters. Further, it would require processing to convert it into the 32bit number. Neither of those things are good.
The 32bit number is a fixed size, so it is easy to get to the 234856th value in the file, just seek to position 4*234856.
If 32bit numbers are stored as ASCII, either they must always take 10 bytes, making the file 2.5 times bigger, or they are stored as variable size, making it virtually impossible to seek to a particular value without reading the whole file.
Edit:
Is is worth adding that (in normal use) a human can not see the data held in a file. The only way to examine the contents of files is by running programs which can read and use the data. So the convenience of a human is a small consideration.
In general, data is stored in the most convenient form for programs use, and the form is designed to fit the programs purpose. ASCII is a format designed for text edit programs to create human readable documents and support simple ways to display the text, which are limited to English letters, numbers and some punctuation. When we want to support all human written language, ASCII is far too limited.
I believe we have over one million characters to represent human written languages (and some other pictures), and we have not yet got characters for all human languages.
UTF-8 is a way to represent the written characters we have so far, as multiple bytes. UTF-8 uses 8bit encoding, which is beyond the range of ASCII.
Think of a binary file as a true representation of data to be interpreted directly by a computer program and not to be read by humans. It would be a lot of overhead for a program to write out data, whether ascii or numeric in an ascii format. Most likely, the programmer would have to invent a protocol for writing arrays, structs, and scalars out into a file in ascii form, so they could be human readable and also be read back in by the program and converted back to binary form.
A database table is a good example. Whether or not there are text or numeric fields in the table, the database manager reads and writes that data in binary format. It is easier to write out, read in, and then convert as needed to display any data you can read.
Perception gave a great answer I had never considered before. All data is binary and ascii is a subset. That answer made me think of ftp and setting the mode to ascii or binary. If I'm shuttling Windows binaries being stored on a Linux system, I tell ftp to transfer them as binary. That means, don't interpret as an ascii file and add \cr at the end of each line. There are even times I'll transfer .csv and .txt data as binary, because I know Windows Excel knows how to interpret those non-DOS files.
I would not want to write a program that had to encode/decode images, or audio files, or GIS data, or spacecraft telemetry, or <fill in the blank> as ASCII.
I am creating a file archiver/extractor (like tar), using POSIX API system calls in C. I have done part of the archiving bit.
I would like to know if any one could help me with some C source code (using above) to create a file header for a file in C (where header acts as an index), which describes the files attributes/meta data (name,date time,etc). All I have done so far is understand (not sure if that's even correct) that to create a file header it needs
a struct to hold the meta data, and lseek is needed to seek to beginning/end of file
like:
FileName=file.txt FileSize=0
FileDir=./blah/blah
FilePerms=000
\n\n
The archiving part of program has this process:
Get a list of all the files from the command line. (I can do this part)
Create a structure to hold the meta data about each file: name (255 char), size (64-bit int), date and time, and permissions.
For each file, get its stats.
Store the stats of each file within an array of structures.
Open the archive for writing. (I can do this part)
Write the header structure.
For each file, append its content to the archive file (at the end/start of each file).
Close the archive file. (I can do this part)
I am having difficulty in creating the header file as a whole even though I know what it needs to do, as mentioned in numbered points above the bits I cant do are stated (2,3,4,6,7).
Any help will be appreciated.
Thanks.
As ijw notes, there are several ways to create an archive file header. If cross-platform portability is going to be an issue at all - or if you need to switch between 32-bit and 64-bit builds of the software on the same platform, even - then you need to ensure that the sizes and layouts of the fields are fully understood on all platforms.
Per-file Metadata
One way to do that is to use a fixed format binary header with types of known size and endianness. This is what ijw suggested. However, you will need to handle long file names, and so you will need to store a length (probably in a 2-byte unsigned integer) and then follow that with the actual pathname.
The alternative, and generally now favoured technique, is to use printable fields (often called ASCII format, though that is something of a misnomer). The time is recorded as the decimal number of seconds since the Epoch converted to a string, etc. This is what modern ar archives use; it is what GNU tar does (more or less; there are some historical quirks that make that more confusing); it is what cpio -c (which is usually the default these days) does. The fields might be separated by nulls or spaces; there is an easy way to detect the end of the header; the header contains information about the file name (not necessarily as directly as you'd like or expect, but again, that is usually because the format has evolved over the years), and then is followed by the actual data. Somehow, you know the size of each field, and the file which the header describes, so that you can read the data reliably.
Efficiency is a red herring. The conversion to/from the text format is so swift by comparison with the first disk access that there is essentially no measurable performance issue. And the guaranteed portability typically far outweighs the (microscopic) performance benefit from using binary data format instead - doubly so when the binary data has to be transformed on input or output anyway to get it into an architecture-neutral format.
Central Index vs Distributed Index
The other issue to consider is whether the index of files in the archive is centralized (at the front, or at the end) or distributed (the metadata for each file immediately precedes the data for the file). There are some advantages to each format - generally, systems use the distributed version because you can write the information for each file without knowing how many files there are to process in total (for example, because you are recursively archiving a directory's contents). Having a central index up front means you can list the files without reading the whole archive - distributed metadata means you have to read the whole file. However, the central index complicates the building of the archive.
Note that even with a distributed index, you will normally need a header for the archive as a whole so that you can detect that the file is in the format you expect. Typically, there is some sort of marker information (!<arch>\n for an ar archive, usually; %PDF-1.2\n at the start of a PDF file, etc) to reassure you that the file contains what you expect. There might be some overall (archive-level) metadata. Then you will have the first file metadata followed by the file data, repeating until the end of archive (which might, or might not, have a formal end marker - more metadata).
[H]ow would I go about implementing it in the 'fixed format binary header' you suggested. I am having trouble with deciding what commands/functions are needed.
I intended to suggest that you do not go with a fixed format binary header; you should use a text-based header format. If you can work out how to do the binary format, be my guest (I've done it numerous times over the years - that doesn't mean I think it is a good idea).
So, some pointers here towards the 'text header' format.
For the file metadata, you might define that you include:
size
mode (permissions, type)
owner
group
modification time
length of name
name
You might reasonably decide that your file sizes are limited to 64-bit unsigned integer quantities, which means 20 decimal digits. The mode might be printed as a 16-bit octal number, requiring 6 octal digits. The owner and group might be printed as UID and GID values (rather than name), in which case you could use 10 digits for each. Alternatively, you could decide to use names, but you should then allow for names up to say 32 characters each. Note that names are typically more portable than numbers. Neither name nor number is of much relevance on the receiving machine unless you extract the data as root (but why would you want to do that?). The modification time is classically a 32-bit signed integer, representing the number of seconds since the Epoch (1970-01-01 00:00:00Z). You should allow for the Y2038 bug by allowing the number of seconds to grow bigger than the 32-bit quantity; you might decide that 12 leading digits will take you beyond the Y10K crisis (by a factor of 4 or so), and this is good enough; you might decide to allow for fractional seconds too. Together, this suggests that 26 spaces for the timestamp should be overkill. You can decide that each field will be separated from the next by a space (for legibility - think 'ease of debugging'!). You might reasonably decide that all file names will be restricted to 4 decimal digits in total length.
You need to know how to format the types portably - #include <inttypes.h> is your friend.
You then devise a format string for printing (writing) the file metadata, and a parallel string for scanning (reading) the file metadata.
Printing:
"%20" PRIu64 " %06o %-.32s %-.32s %26" PRIu64 " %-4d %s\n"
This prints the name too. It terminates the header with a newline. The total size is 127 bytes plus the length of the file name. That's probably excessive, but you can tweak the numbers to suit yourself.
Scanning:
"%" SCNu64 " %o %.32s %.32s %" SCNu64 "%d"
This does not scan the name; you need to create the scanner for the name carefully, not least because you need to read spaces in the name. In fact, the code for scanning the user name and group name both assume no spaces too. If that is not acceptable (that is, names may contain spaces), then you need a more complex scan format, or something other than sscanf() to process the input data.
I'm assuming a 64-bit integer for the time field, rather than mixing up fractional seconds, etc, even though there's enough space to allow for fractional seconds. You'd likely save some space here.
The getting of information for each file you can do with the stat() system call.
For the writing of the header, here's two solutions.
Trivial but evil:
struct file_header {
... data you want to put in
} fhdr;
fwrite(file, fhdr, sizeof(fhdr));
This is evil because structure packing varies from machine to machine, as does byte order and the size of basic types like 'int'. A file written by your program may not be readable by your program when it's compiled on another machine, or even with another compiler on the same machine in some cases.
Non-trivial but safe:
char name[xxx];
uint32_t length; /* Fixed byte length across architectures */
...
fwrite(file, name, sizeof(name));
length=htonl(length); /* Or something else that converts
the length to a known endianness */
fwrite(file, &length, sizeof(length);
Personally I'm not a fan of htonl() and friends, I prefer to write something that converts a uint32_t to a uchar[4] using shift operators (which can be written trivially using shift operators) because C doesn't pin down the format of even an integer in memory. In practice you'd be hard pushed to find something that doesn't store a uint32_t as 4 bytes of 8 bits, but it's a thing to consider.
The variables listed above can be structure members in your structure. Reversing the process on read is left as an exercise to the reader.
I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.