What do files actually contain, and how are they "read"? What is a "format" and why should I worry about them? - file

As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:
What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.
More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?
Related: What exactly causes binary file "gibberish"?.

Data, bits and bytes
Everyone who has had to buy hardware, or arrange a network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data: the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.
Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information, representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)
A byte is simply a grouping of bits in a standard size. Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning, and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.
Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".
How data is made useful
It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.
It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3}, we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3}, and one that tells us whether it's in {0, 2} or {1, 3}. If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic, to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16..., and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly, we can represent negative integers as well. If we let some bits correspond to binary fractions as well (1/2, 1/4, 1/8...), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.
Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.
Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second, we can represent sound audible to humans.
Similarly, having studied how the human eye works, we find that we can analyze colours as combinations of three intensity values (i.e., numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid), we can represent images. By considering images across time (a few tens of times per second), we can represent animations.
And so on, and so on.
Choosing an interpretation
There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?
Plainly, the raw data stored by a computer doesn't inherently represent anything specific. Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.
It just... isn't likely to appear like anything meaningful, that way.
However, the choice of interpretations is a choice... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata: data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).
What is a (file) format?
Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.
Put another way: a format is the meaning represented by some metadata.
A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)
(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)
A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding, i.e. the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding, and has become dominant.
What actually happens when we read the file?
Raw data is transferred from the file on disk, into the program's memory.
That's it.
Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text", and ignoring this can have all kinds of strange consequences.)
But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation, and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions. The solution - the Unicode standard, mentioned several times above - had its first release in 1991, but there are still a few programmers today blithely ignoring it.
But enough ranting.
How does interpreting a file work?
In order to display an image, render a web page, play sound or anything else from a file, we need to:
Have data that is actually intended to represent the corresponding thing;
Know the format that is used by the data to represent the thing;
Load the data (read the file, or read data from a network connection, or create the data by some other process);
Process the data according to the format.
This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen, or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).

Related

Why encoding needed in Database

I have recently crossed the term encoding. I learned encoding is used to standardize different characters. Databases also encode data to standardize the table data. My question is If my column only contains decimal numbers still encoding is needed? , Is encoding have anything to do with the memory size of the data?.
With the term encoding, we means putting a semantic value in a binary representation (binary: for most modern computer context; in theory encoding is nor restricted to computer bits/bytes).
The computers works differently than us, so we need to encode everything, so that computer can interpret the numbers. Sometime the encoding is implicit, or sometime outside CPU knowledge.
On a strict form of encoding (now out of fashion) when you decide the size of an integer (and whenever it is signed or unsigned), you are choosing (partially) an encoding (some part is still often implicit: type of representation of signed characters, endness of numbers, etc.). As long you are not reading the binary file of database, these doesn't matter much, but the size change the amount of memory/disk used. (note, we tend to call this type, not encoding, for numbers)
But as you see, now encoding is often used for strings (since tens of years we do care much about encoding for numbers). With this definition (and if you see "Unicode" or "UTF-8" or something similar, this mean that we are discussing just for string encoding), it doesn't matter the encoding of a number, but just the type.
But than there are locales (e.g. internationalization): you may need a specific encoding, so that e.g. thousand separator will be printed correctly, but this may be specified as locale/collation, or as database encoding, or in the client (which is the program in charge to display data).
So: for memory and disk (database engine side), the encoding (as string encoding) of a decimal and floats doesn't matter, just select the right type. For blobs and strings, the (string) encoding matter (but also for this case, you may need to check what are the available encodings on the engine: sometime the engine will use few encodings, and transform to other on software, so no changes on disk/memory). Note: numbers are numbers, so it doesn't matter much, but for string, many SQL functions depends on encoding and locale (so on "server side", like LIKE keyword, sorting, etc.).
Your first question:If column contains decimal numbers still encoding is needed?
Answer : Columns in a table can and contain any data whether it is in form of decimal numbers or any other.
Every data no matter it is in which form either decimal,floating point,characters etc needs encoding to make it secure.Especially when you work with sensitive data or are looking for a way to organize your data in an archive you should consider encoding your data.
Your second question:Is encoding have anything to do with the memory size of the data?.
Answer : Since encoding removes redundancies from data, the size of your files will be a lot smaller. This results in faster input speed when data is saved.
Encoded data is smaller in size, you should be able to save space on your storage devices. This is ideal if you have large amounts of data that need to be archived.

Efficient Output Format for Huge Data Sets?

I have written a program that writes output to a file. The output is in 6-column, n-row format, and all values are double-precision float. It is very common in my code for n to become extremely large (1e20 or so) and, hence, the output data file also becomes extremely large.
I am currently storing everything in *.csv format, which obviously results in huge data files. Is there any more efficient way to store such values? Any new format of file or any new method that would decrease the file size significantly?
For clarification:
The data does not need to be human readable, binary would do just fine.
I will further process the data in the file to get some important parameters from the runs, probably the distance of travel, time of exit at a particular point etc.
The code is actually an astrophysical simulation of moving particles and for about 1e10 particles for a million time steps each, it gets quite high for size.
When designing a file format you have to consider various things, like:
a) Is there a chance that the file could have been corrupted, or maliciously tampered with (or is there any kind of confidentiality requirement)? The answer to this is almost always "yes". To guard against these things you need to consider some kind of checksum and/or encryption. You may also need to consider whether partial recovery is desirable (e.g. is it beneficial to split the file into multiple blocks/sections where each block has its own checksum/encryption, so that if 4 bytes in one block/section are corrupted you can still recover the majority of the data).
b) Is there a portability concern? For example, if you store raw double values in the file will it create a problem on other computers that have a different binary format for "double"?
c) For each type of value; what is the range that actually needs to be represented and what is the precision requirements? Typically software uses "larger and more precise" than necessary (often because its faster to select the next largest type the CPU supports); but for file formats this causes an unnecessary increase in file sizes. For a simple example; maybe you could convert a (64-bit) double into a 32-bit fixed point format and halve the space used while still achieving the range and precision that's actually needed.
d) Are there "clever" ways to reduce the range and precision required for some of the values? For a simple example; maybe you have "starting value" and "ending value" which both do need 64 bits; but you could convert it into "starting value" and "difference" (so that "ending value" can be calculated as "starting value + difference") where "difference" values have less range and only needs 32 bits to store.
e) Is any kind of indexing beneficial? For a simple example; if the file might contain 1 million entries and you only want to find one, then you might be able to use the index to find the offset for the entry you want and only load that one entry (and avoid loading all 1 million entries).
f) What other meta-data might you want? This can be things like a "magic signature" (so that software can check if a file is supposed to comply with the file format and the user didn't give your program the wrong type of file), a "file format version number" (so that the program can do "auto-update to new file format" or at least detect when the file uses an obsolete/deprecated file format that is no longer supported). It can also include information to identify things like who the author was, where the data came from, when the data was obtained, which program created/prepared the file, etc. Sometimes there is also optional data and flags to say if the optional data is/isn't included in the file. You might also want things like "number of entries" and "offset in file for each different area", etc.
g) What kind of allowances do you need to make for extensibility (and backward compatibility, and forward compatibility)? Typically people leave things like (e.g.) "reserved for future use" fields in headers so that they can add/change/extend the file format in future without breaking everything. Sometimes this is even more specific about what software should do when it sees values in reserved fields that it doesn't support - e.g. "reserved for future use, should be zero, if non-zero software should ignore this value" vs. "reserved for future use, should be zero, if non-zero (due to future use) software should generate an error and not use the file"
h) Are any kind of compression techniques useful? For a simple example, if you have "6-columns, N-rows" with an index, and sometimes the data for 2 or more rows happens to be the same; then maybe you can only store one copy of the data for those rows and then use the index to figure out which row uses which data (a bit like "row[n] = unique_row_data[ index[n] ]").

Convert COMP and COMP-3 Packed Decimal into readable value with C

I have an EBCDIC flat file to be processed from a mainframe into a C module. What can be a good process in converting the COMP and COMP-3 values into readable values? Do I have to convert the ebcdic characters to ascii then hex for COMP-3? What about for COMP? Thanks
Bill Woodger has given you some very good advice through his comments to your question, actually he answered the question and should have
posted his comments as an answer.
I would like to reiterate a few of his points and expand on a few others.
If you need to convert a file created from what is probably a COBOL application so it may be read
by some other non-COBOL program, possibly on a machine with an architecture unlike the one where it was created, then
you should demand that the file be created using only display formatted data (i.e. all character data). Mashing non-display
(binary, packed, encoded) data outside of the operating environment where it was created is just a formula for
long term pain. You will be subjected to the joys of sorting out various endianness issues
between architectures and code page conversions. These are the things that
file transfer protocols are designed to manage - they do it well so don't try to reinvent them. Short answer, use FTP or
similar file transport mechanism to move data between machines. And only transport display (character) based data.
Packed Decimal (COMP-3) data types occupy a varying number of bytes depending on their specific PICTURE layout. The position of the decimal point
is implied so cannot be determined without reference to the PICTURE used to define it. Packed Decimal fields may be either signed
or unsigned. If signed, the sign is imbedded in the low 4 bits of the least significant digit. Each byte of a Packed Decimal
data type contains two digits, except possibly the first and last bytes. The first byte contains only 1 digit if the field is signed
and contains an even number of digits. The last byte contains 2 digits if unsigned but only 1 if signed. There are several other subtlies that
you need to be aware of if you want to do your own Packed Decimal to character conversions. At this point I hope you can see
that this is not going to be a trivial exercise.
Binary (COMP) data types have a different but no less complex set of issues to resolve. Again, not a trivial exercise.
So what should you be doing? Basically, do as Bill suggested. Have the program that generates this file use display formats
for output (meaning you have to do nothing). Or, failing that, use a utility program such as DFSORT/SYNCSORT do the conversions
for you. Going the utility
route still requires that you have the original COBOL file layout (and that you understand it) in order to do the conversion.
The last resort is simply writing a simple read-a-record-write-a-record COBOL program that takes in the unformatted data, MOVEes
each COMP-whatever field to a corresponding DISPLAY field and write it out again.
As Bill said, if the group that produced this file tells you that it is too difficult/expensive to produce a DISPLAY formatted
output file they are lying to you or they are incompetent or just too lazy to
do the job they were hired to do. I can think of no other excuses.
Use XML to transport data.
That is, write a program that converts your file into characters (if on mainframe, stay with the EBCIDIC but numeric fields are unpacked, etc.) and then enclose each record and each field in XML tags.
This avoids formatting issues (what field is in column 1, what field in column 2, are the delimters spaces or commas or either, etc. ad nauseum).
Then transmit the XML file with your favorite utility that converts from EBCIDIC to ASCII.

How do you create a file format?

I've been doing some reading on file formats and I'm very interested in them. I'm wondering what the process is to create a format. For example, a .jpeg, or .gif, or an audio format. What programming language would you use (if you use a programming language at all)?
The site warned me that this question might be closed, but that's just a risk I'll take in the pursuit of knowledge. :)
what the process is to create a format. For example, a .jpeg, or .gif, or an audio format.
Step 1. Decide what data is going to be in the file.
Step 2. Design how to represent that data in the file.
Step 3. Write it down so other people can understand it.
That's it. A file format is just an idea. Properly, it's an "agreement". Nothing more.
Everyone agrees to put the given information in the given format.
What programming language would you use (if you use a programming language at all)?
All programming languages that can do I/O can have file formats. Some have limitations on which file formats they can handle. Some languages don't handle low-level bytes as well as others.
But a "format" is not an "implementation".
The format is a concept. The implementation is -- well -- an implementation.
You do not need a programming language to write the specification for a file format, although a word processor might prove to be a handy tool.
Basically, you need to decide how the information of the file is to be stored as a sequence of bits. This might be trivial, or it might be exceedingly difficult. As a trivial example, a very primitive bitmap image format could start with one unsigned 32-bit integer representing the width of the bitmap, and then one more such integer representing the height of the bitmap. Then you could decide to simply write out the colour of the pixels sequentially, left-to-right and top-to-bottom (row 1 of pixels, row 2 of pixels, ...), using 24-bits per pixel, on the form 8 bits for red + 8 bits for green + 8 bits for blue. For instance, a 8×8 bitmap consisting of alternating blue and red pixels would be stored as
00000008000000080000FFFF00000000FFFF0000...
In a less trivial example, it really depends on the data you wish to save. Typically you would define a lot of records/structures, such as BITMAPINFOHEADER, and specify in what order they should come, how they should be nestled, and you might need to write a lot of indicies and look-up tables. Myself I have written quite a few file formats, most recently the ASD (AlgoSim Data) file format used to save AlgoSim structures. Such files consists of a number of records (maybe nestled), look-up tables, magic words (indicating structure begin, structures end, etc.) and strings in a custom-defined format. One typical thing that often simplifies the file format is that the records contain data about their size, and the sizes of the custom data parts following the record (in case the record is some sort of a header, preceeding data in a custom format, e.g. pixel colours or sound samples).
If you havn't been working with file formats before, I would suggest that you learn a very simple format, such as the Windows 3 Bitmap format, and write your own BMP encoder/decoder, i.e. programs that creates and reads BMP files (from scratch), and displays the read BMP files. Then you now the basic ideas.
Fundamentally, files only exist to store information that needs to be loaded back in the future, either by the same program or a different one. A really good file format is designed so that:
Any programming language can be used to read or write it.
The information a program would most likely need from the file can be accessed quickly and efficiently.
The format can be extended and expanded in the future, without breaking backwards compatibility.
The format should accommodate any special requirements (e.g. error resiliency, compression, encoding, etc.) present in the domain in which the file will be used
You are most certainly interested in looking into Protocol Buffers and Thrift. These tools provide a modern, principled way of designing forwards and backward compatible file formats.

How do I choose a good magic number for my file format?

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

Resources