How do different file types differ? - file

So far, I understood:
Files have some information in their 'headers' which programs use to detect the file type.
Is it always true? If so how can I see the header?
Is it the only way of defining a file's type?
Is there a way to manually create an empty file (at least in Linux) with no headers at all?
If so, can I manually write its header and create a simple 'jpg' file

No, files simply have bytes and some metadata like a filename, permissions, last modified time. The format of those bytes is completely free and open without convention. Certainly some file types like jpegs,gifs,audio and video files have headers specified in their formats. Viewing a header depends completely on the format involved. They will normally be comprised of byte codes meaningless to the human eye so some software would normally be required to decode and view them.
yes.
touch emptyFile
Sounds painful. Use a library to write a jpeg. Headers are not necessarily easy to create. Someone else has done this hard work for you so I'd use it.

A file is nothing more than a sequence of bytes, and it has no default, internal structure. It is an abstraction made by the OS to make it more convenient to store and manipulate data.
Files may represent different types of things like images, videos, audio and plain text, so these need to be interpreted in a certain way in order to interact with their contents. For instance; an image is opened in an image viewer, a PDF document is opened in a PDF viewer; an audio file is opened in a media player. That doesn't mean you cannot open an image in a text editor - the file's contents will only be interpreted differently.
The closest thing to file metadata in UNIX and Linux is the inode - which stores information about files, but are not part of the files themselves - and the files's magic number. Use stat to inspect the inode and use file to determine its type (sometimes based on its magic number).
Also check out man file for more information about file types.

Related

C - Storing a large group of files as a single resource

Please forgive me if there is a glaringly obvious answer to this question; I haven't found it because I'm not entire sure what I'm looking for. It may well be this duplicates a question I haven't found; sorry.
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly (not that I anticipate it being distributed I'm looking to package my own work for convenience).
In my own opinion it would be more convenient if the file library was stored in a single file that remained accessible to the application for example alongside /usr/bin/APPLICATION or in the most appropriate location; accessed by the executable when required.
I searched for questions similar and found suggestions that indicated two possible options Resource Files which appear to be native to Windows and Including files at compile. The first question leads to an answer similar to the second and doesn't answer the question relating to the existence of resource files for linux executables. It (like the second) looks at including the datafile in the compilation process. This is not so useful as if I only want to update my resources I'm forced to recompile the entire application (the media is dynamically added).
QUESTION: Is there a way to store a variety of file types in one single file accessible to an executable in linux, and if so how would you implement this?
My thoughts on this initially were to create a .zip or .gz file which might also offer compression as an added bonus but I have no idea how (or if it is even possible) to access data within such a file on the fly. I'm equally uncertain if there is a specific file type or library that offers a more suitable solution. Also I know virtually nothing about .dat files could these be used in this context on a linux system?
I do not understand why you would use a single file at all. Considering the added complexity (and increased chance of bugs creeping in) of file extraction and the associated overheads, I do not see how it would be "more convenient".
I have a C executable that uses text, audio, video, icons and a variety of different file types.
So do many other Linux applications. The normal approach, when using package management, is to put the architecture independent data (icons, audio, video, and so on) for application /usr/bin/YOURAPP in /usr/share/YOURAPP/, and architecture dependent data (like helper binaries) in /usr/lib/YOURAPP. It is extremely common for the latter two to be full directory trees, sometimes quite deep and wide.
For locally compiled stuff, it is common to put these in /usr/local/bin/YOURAPP, /usr/local/share/YOURAPP/, and /usr/local/share/YOURAPP/ instead, just to avoid confusing the package manager. (If you check ./configure scripts or read Makefiles, this is the chief purpose of the PREFIX variable they support.)
It is also common for the /usr/bin/YOURAPP to be a simple shell script, setting environment variables, or checking for user-specific overrides (from $HOME/.YOURAPP/), ending up with exec /usr/lib/YOURAPP/YOURAPP.bin [parameters...], which replaces the shell with the actual binary executable without leaving the shell in memory.
As an example, /usr/share/octave/ on my machine contains a total of 138 directories (in a hierarchy of up to 7 directories deep) and 1463 files; about ten megabytes of "stuff" all told. LibreOffice, Eagle, Fritzing, and KiCAD take hundreds of megabytes there each, so Octave is not an extreme example in any way either.
You have several alternatives (TODO: add more ;)):
You can read some archiver file format specifications, writting code to read/write to those archivers, and waste your time doing so.
You can invent a dirty, simple file format, for example ("dsa" stands for "Dirty and Simple Archiver"):
#include <stdint.h>
// Located at the beginning of the file
struct DSAHeader {
char magic[3]; // Shall be (char[]) { 'D', 'S', 'A' }
unsigned char endianness; // The rest of the file is translated according to this field. 0 means little-endian, 1 means big-endian.
unsigned char checksum[16]; // MD5 sum of the whole file. (when calculating checksums, this field is psuedo-filled with zeros).
uint32_t fileCount;
uint32_t stringTableOffset; // A table containing the files' names.
};
// A dsaHeader.fileCount-sized array of DSAInodeHeader follows the DSAHeader.
struct DSANodeHeader {
unsigned char type; // 0 means directory, 1 means regular file.
uint32_t parentOffset; // Pointer to the parent directory, or zero if the node is in the root.
uint32_t offset; // The node's type-dependent header starts here.
uint32_t nodeSize; // In bytes for files, and in number of entries for directories.
uint32_t dataOffset; // The file's data starts at this offset for files, and a pointer to the first DSADirectoryEntryHeader for directories.
uint32_t filenameOffset; // Relative to the string table.
};
typedef uint32_t DSADirectoryEntryHeader; // Offset to the entry's DSANodeHeader
The "string table" is a contiguous sequence of null-terminated character strings.
This format is greatly simple (and portable ;)). And, as a bonus, if you want (de)compression, you can use something like Zip, BZ2, or XZ to (de)compress your file (those programs/formats are archiver-agnostic, i.e, not dependent on tar, as commonly believed).
As last last (or first?) resort, you may use an existent library/API for manipulating archivers and compressed file formats.
Edit: Added support for directories :).
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly.
Considering the added complexity of associated differrent file types alongwith folder structure large and deep and required installed with application. Adding a single resources file would be difficult or would say near to immpossible to trace changes in case if you want to change resources dynamically. Certainly, adding resources to executable file is not an option as it will be increase the size of executable file and needed frequent re-complation in case of update of resources.
After giving consideration on all aspects of your project it seems to me the solution would be using INI file. INI would be stored at definate location and other resources location should be prived in INI File. As with INI you can store the locations of resources, hash keys and sizes easily and would easy check the changes or update the resources.
Since you are using already compressed versions of File type and thus General Zipping algos would not work as the rate would be very low. Thus recommend to use 7z algos for compression. From various algo I would suggest to opt of xz zipping algo as it is currently used by many opensource project to compress the binaries and decrease the size.
Foreach file compression its crc32 or hash value should also included in INI file to check the validity of data transfered.
Lets say you have:
top-level-folder/
|
- your-linux-executable
- icon-files-folder/
- image-files-folder/
- other-folders/
- other-files
Do this (inside top-level-folder)
tar zcvf my-package.tgz top-level-folder
To expand, do this:
tar zxvf my-package.tgz

How are the binary data inside a certain format parsed?

Considering a binary data (video/images/audio/executable) can be regarded as a long sequence of random bytes,
when the data is inside a special format (SQL, BOLB in database, MP3, JSON, XML etc), how does the parser know that a special char(or sequence of chars, like {,},\t,space,EOF) is used in formatting, not a part of the binary data and vice versa?
Also, I am not quite sure which category this question fits in, so I put lexical analysis and linguistics. What subject/fields of computer science studies this?
This is indeed an odd place for this question. I'm a little unclear on exactly what you're asking here, but in sum, not all binary data (assuming you mean machine readable data) are equal. For instance: audio, images, and video are not executable data, they are parsed data; as such they are handled differently.
Also, "binary data" are not as arbitrary as you might think upon opening a hex editor for the first time :). Executables are structured into DATA and CODE segments, so with those flags the computer knows how to treat things appropriately. As for the other three types you mentioned, they are all structured differently depending upon their file format, which is why so many different file formats are out there! The executable program which parses these files knows how to handle them based upon information contained in the code about the file format, which of course means that the program has to know how to handle the file format and have info on how it is segmented to load it properly, which is why you can't open an MP3 in Microsoft Paint.
As for the study of file formats and data storage, that has applications in a lot of areas, it's not really a field unto itself so much as a topic that comes up in a lot of areas. Information Theory, Reverse Engineering, Natural Language Processing, and many others have uses for understanding different file types and how they store data. Anyhow, this was only a brief, cursory explanation, and there's plenty of things you can google (try .exe file formats or .jpg/.png file formats to start).

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

Reading tag data for Ogg/Flac files

I'm working on a C library that reads tag information from music files. I've already got ID3v2 taken care of, but I can't figure out how Ogg files are structured.
I opened a .ogg file in a hexeditor and I could find the tag data because that was all human readable. But everything from the beginning of the file to the tag data looked like garbage. How is this data encoded?
I don't need any help in the actual code, I just need help visualizing what a Ogg header looks like and what encoding it uses so I that I can read it. I'd like to use a non-hacky approach to reading Ogg files.
I've been looking at the Flac format, which has been helpful.
The Flac file I'm looking at has about 350 bytes between the "fLac" identifier and the human readable Comments section, and none of it is human readable in my hex editor, so I'm sure there has to be something important in there.
I'm using Linux, and I have no intention of porting to Windows or OS X. So if I need to use a glibc only function to convert the encoding, I'm fine with that.
The Ogg file format is documented here. There is a very nice graphical visualization as you requested with a detailed written description.
You may also want to look at libogg which is a open source BSD-licensed library for reading and writing Ogg files.
As is described in the link you provided, the following metadata blocks can occur between the "fLaC" marker and the VORBIS_COMMENT metadata block.
STREAMINFO: This block has information about the whole stream, like sample rate, number of channels, total number of samples, etc. It must be present as the first metadata block in the stream. Other metadata blocks may follow, and ones that the decoder doesn't understand, it will skip.
APPLICATION: This block is for use by third-party applications. The only mandatory field is a 32-bit identifier. This ID is granted upon request to an application by the FLAC maintainers. The remainder is of the block is defined by the registered application. Visit the registration page if you would like to register an ID for your application with FLAC.
PADDING: This block allows for an arbitrary amount of padding. The contents of a PADDING block have no meaning. This block is useful when it is known that metadata will be edited after encoding; the user can instruct the encoder to reserve a PADDING block of sufficient size so that when metadata is added, it will simply overwrite the padding (which is relatively quick) instead of having to insert it into the right place in the existing file (which would normally require rewriting the entire file).
SEEKTABLE: This is an optional block for storing seek points. It is possible to seek to any given sample in a FLAC stream without a seek table, but the delay can be unpredictable since the bitrate may vary widely within a stream. By adding seek points to a stream, this delay can be significantly reduced. Each seek point takes 18 bytes, so 1% resolution within a stream adds less than 2k. There can be only one SEEKTABLE in a stream, but the table can have any number of seek points. There is also a special 'placeholder' seekpoint which will be ignored by decoders but which can be used to reserve space for future seek point insertion.
Just after the above description, there's also the specification of the format of each of those blocks. The link also says
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
So, what are you missing? You say
I'd like a non-hacky approach to reading Ogg files.
Why re-write a library to do that when they already exist?

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources