FAT32: root directory entries - c

We are building a fat32 filesystem manipulation tool in C and are currently trying to access all the entries in the root directory (situated right after the two FAT tables).
The first question is : Are all the root directory entries contiguous in the data region ? If not, given the first entry, how can we access the next entry ?
Does it have anything to do with the tags "low cluster / high cluster" or do we need to look in the FAT table for it (root directory) ?
Basically, we have the "equation" that leads us to the data region. Based on that, we point on the cluster, but after that, we don't really know how to find the next entry in the Root Directory.
This might seem confusing, but if you need pieces of code or more information, I will provide them.
Thank you in advance.

FAT (also FAT32) directory entries are 32bytes and appear in a sequential order.
To store long file names an entry could need multiples of 32 bytes.
About how L(ong)F(ile)N(ames) are marked (from wikipedia):
Long File Names (LFN) are stored on a FAT file system using a trick—adding (possibly multiple) additional entries into the directory before the normal file entry. The additional entries are marked with the Volume Label, System, Hidden, and Read Only attributes (yielding 0x0F), which is a combination that is not expected in the MS-DOS environment, and therefore ignored by MS-DOS programs and third-party utilities. (ff)
Referring your second question (from wikepedia):
[...] VFAT LFN entries always have the cluster value at 0x1A set to 0x0000 and the length entry at 0x1C is never 0x00000000 [...]

Related

What is the real size of file?

How it is possible that text file has size, which is equal to number of chars inside? For example in file.txt you have string "abc" and size of it is 3 bytes. Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
I checked it on Windows, but at Unix systems situation is probalby the same.
When the file is written to disk, it is by means of low level system call like write() and operating systems know exactly how many bytes they write in a given file on a disk. This information, as well as several others (creation and modification date, ownership, etc) is written with the file.
In linux (and generally unix), it is by means of an inodethat fully describes the file. Informations stored in these inodes are:
* access mode
* ids of user and group that owns the file
* size in bytes
* date of creation, modification and access
* list of disk blocks containing file data
These are more or less the informations that are displayed by ls -l
You can also see inode number of each file with ls -i
You can find here additional details on inodes.
Other informations are coded differently. Names, for instance are only in special files describing a directory, not in the inode. A directory is indeed a list that associate a name with an inode.
Icons are generally defined system wide and the association of an icon with a file is done with either filename (and file extension) or with a file "type" that is written in the "inode" (or its equivalent in other OS).
Disks allocate space in blocks. Blocks historically were 512 bytes but that has increased over the years so that 4K is common. Your file size will always be a multiple of the block size.
Most file systems (and Windoze does this) allocate disk space in clusters. A cluster is a number of adjacent blocks. Your file size then will always be a multiple of the block size times the cluster factor. This is usually the size of the file as counted by the operating system.
This all depends upon the disk format and the operating system:
Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
The file information (date, owner, etc.) are usually in some kind of master file table. Sometimes this table will have extensions where information can be stored. Security information is often store in such overflows.
A rationally designed file system will have "A" filename stored in the header. File names are also stored in directories and a file can have multiple names if it is linked to multiple directories. The header file name is used to restored the file in case of corruption.
The location of an icon is entirely system specific and can be done in many ways. In the case of executable files, they are often stored in the file itself. They can also be hidden files in the same directory.

C - Storing a large group of files as a single resource

Please forgive me if there is a glaringly obvious answer to this question; I haven't found it because I'm not entire sure what I'm looking for. It may well be this duplicates a question I haven't found; sorry.
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly (not that I anticipate it being distributed I'm looking to package my own work for convenience).
In my own opinion it would be more convenient if the file library was stored in a single file that remained accessible to the application for example alongside /usr/bin/APPLICATION or in the most appropriate location; accessed by the executable when required.
I searched for questions similar and found suggestions that indicated two possible options Resource Files which appear to be native to Windows and Including files at compile. The first question leads to an answer similar to the second and doesn't answer the question relating to the existence of resource files for linux executables. It (like the second) looks at including the datafile in the compilation process. This is not so useful as if I only want to update my resources I'm forced to recompile the entire application (the media is dynamically added).
QUESTION: Is there a way to store a variety of file types in one single file accessible to an executable in linux, and if so how would you implement this?
My thoughts on this initially were to create a .zip or .gz file which might also offer compression as an added bonus but I have no idea how (or if it is even possible) to access data within such a file on the fly. I'm equally uncertain if there is a specific file type or library that offers a more suitable solution. Also I know virtually nothing about .dat files could these be used in this context on a linux system?
I do not understand why you would use a single file at all. Considering the added complexity (and increased chance of bugs creeping in) of file extraction and the associated overheads, I do not see how it would be "more convenient".
I have a C executable that uses text, audio, video, icons and a variety of different file types.
So do many other Linux applications. The normal approach, when using package management, is to put the architecture independent data (icons, audio, video, and so on) for application /usr/bin/YOURAPP in /usr/share/YOURAPP/, and architecture dependent data (like helper binaries) in /usr/lib/YOURAPP. It is extremely common for the latter two to be full directory trees, sometimes quite deep and wide.
For locally compiled stuff, it is common to put these in /usr/local/bin/YOURAPP, /usr/local/share/YOURAPP/, and /usr/local/share/YOURAPP/ instead, just to avoid confusing the package manager. (If you check ./configure scripts or read Makefiles, this is the chief purpose of the PREFIX variable they support.)
It is also common for the /usr/bin/YOURAPP to be a simple shell script, setting environment variables, or checking for user-specific overrides (from $HOME/.YOURAPP/), ending up with exec /usr/lib/YOURAPP/YOURAPP.bin [parameters...], which replaces the shell with the actual binary executable without leaving the shell in memory.
As an example, /usr/share/octave/ on my machine contains a total of 138 directories (in a hierarchy of up to 7 directories deep) and 1463 files; about ten megabytes of "stuff" all told. LibreOffice, Eagle, Fritzing, and KiCAD take hundreds of megabytes there each, so Octave is not an extreme example in any way either.
You have several alternatives (TODO: add more ;)):
You can read some archiver file format specifications, writting code to read/write to those archivers, and waste your time doing so.
You can invent a dirty, simple file format, for example ("dsa" stands for "Dirty and Simple Archiver"):
#include <stdint.h>
// Located at the beginning of the file
struct DSAHeader {
char magic[3]; // Shall be (char[]) { 'D', 'S', 'A' }
unsigned char endianness; // The rest of the file is translated according to this field. 0 means little-endian, 1 means big-endian.
unsigned char checksum[16]; // MD5 sum of the whole file. (when calculating checksums, this field is psuedo-filled with zeros).
uint32_t fileCount;
uint32_t stringTableOffset; // A table containing the files' names.
};
// A dsaHeader.fileCount-sized array of DSAInodeHeader follows the DSAHeader.
struct DSANodeHeader {
unsigned char type; // 0 means directory, 1 means regular file.
uint32_t parentOffset; // Pointer to the parent directory, or zero if the node is in the root.
uint32_t offset; // The node's type-dependent header starts here.
uint32_t nodeSize; // In bytes for files, and in number of entries for directories.
uint32_t dataOffset; // The file's data starts at this offset for files, and a pointer to the first DSADirectoryEntryHeader for directories.
uint32_t filenameOffset; // Relative to the string table.
};
typedef uint32_t DSADirectoryEntryHeader; // Offset to the entry's DSANodeHeader
The "string table" is a contiguous sequence of null-terminated character strings.
This format is greatly simple (and portable ;)). And, as a bonus, if you want (de)compression, you can use something like Zip, BZ2, or XZ to (de)compress your file (those programs/formats are archiver-agnostic, i.e, not dependent on tar, as commonly believed).
As last last (or first?) resort, you may use an existent library/API for manipulating archivers and compressed file formats.
Edit: Added support for directories :).
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly.
Considering the added complexity of associated differrent file types alongwith folder structure large and deep and required installed with application. Adding a single resources file would be difficult or would say near to immpossible to trace changes in case if you want to change resources dynamically. Certainly, adding resources to executable file is not an option as it will be increase the size of executable file and needed frequent re-complation in case of update of resources.
After giving consideration on all aspects of your project it seems to me the solution would be using INI file. INI would be stored at definate location and other resources location should be prived in INI File. As with INI you can store the locations of resources, hash keys and sizes easily and would easy check the changes or update the resources.
Since you are using already compressed versions of File type and thus General Zipping algos would not work as the rate would be very low. Thus recommend to use 7z algos for compression. From various algo I would suggest to opt of xz zipping algo as it is currently used by many opensource project to compress the binaries and decrease the size.
Foreach file compression its crc32 or hash value should also included in INI file to check the validity of data transfered.
Lets say you have:
top-level-folder/
|
- your-linux-executable
- icon-files-folder/
- image-files-folder/
- other-folders/
- other-files
Do this (inside top-level-folder)
tar zcvf my-package.tgz top-level-folder
To expand, do this:
tar zxvf my-package.tgz

FAT System Identification of free space and structure of entry files?

Been seaching google for a good explanation for how FAT systems identify free space and the structure of FAT Entry files.
Alot of the explanations ive found are quite hard to follow can anyone help brief sum these up?
i understand that clusters are marked as unused but is this within the root directory or data region? and is the information on clusters status just marked in a table?
I haven't managed to gain any knowledge on the structure of the entry files either, just that they use chains to keep the clusters together
Anyone help?
A file system can be thought of having three (3) types of data: file data, file meta-data and file system meta-data. File data is file or directory contents. File meta-data is that which tells us where the file data is stored on the disk. File system meta-data tells us how the file system allocates the blocks used in the file system.
The FAT file system however does not keep the lines so clear cut. Its disk structures often blur these distinctions.
The File Allocation Table (FAT) itself blurs the lines of the file meta-data and file system meta-data. That is, the FAT entries identify both the cluster number of the where the next cluster of file (or directory) data can be found as well as indicating to the file system whether the cluster identified by the index into the FAT is available (or not). As you indicated in your question, this forms a chain. A special marker (the specific value escapes my memory) indicates that the cluster identified by the index into the FAT is the last cluster in the chain.
Directory entries in a FAT based file system are both file data and file meta-data. They read like files with their entries being the "file data". However, their entries are also interpreted as file meta-data, for they contain the file attributes (permissions, file size, and the starting cluster number--which is an index into the FAT).
The root directory is a special directory on a FAT file system. If memory serves, it does not have either a "." nor a ".." entry. On FAT12 and FAT16 systems, the size of the root directory is specified when the disk is formatted and is thus of fixed size--however, its clusters are still marked in the FAT. On FAT32, the root directory size is not set at format time and can grow. The starting cluster of the root directory is stored in a special field in one of the file system meta-data structures (as I'm going by memory the name of this structure eludes me).
Hope this helps.
Here is a fairly long article that has lots of information about fat file systems.
It should provide all the details you need.
http://en.wikipedia.org/wiki/File_Allocation_Table

Changing inode behaviour

I am trying to modify the ext3 file system. Basically I want to ensure that the inode for a file is saved in the same (or adjacent) block as the file that it stores metadata for. Hopefully this should help disk access performance
I grabbed the kernel source, compiled it, read a bunch about inodes and looked the inode.c file in the fs subdirectory. However, I am just not sure how I can ensure that any new file being created, and the inode for this file, can be saved in the same or adjacent blocks. Any help or pointers to further readings would be appreciated. Thanks!
Interesting idea.
I'm not deeply familiar with ext3, but I can give you some general pointers.
Currently ext3 stores inodes in predetermined places. Each block group has its own inode table, an array of inodes. So when you have an inode number (i.e., as the result of looking up a filename in a directory), you can find the corresponding inode on disk by using the inode number first to select the correct block group and then to index into that block group's inode table.
If you want to put the inodes next to the corresponding file data, you'll need a new scheme for finding an inode on disk. If you're willing to dedicate a block for each inode, then one possible scheme would be to allocate a new block every time you need an inode and then use the block number as the inode number. This might have the benefit that for small files you could store the data in that same block.
To make something like this happen, creating a new file (i.e., allocating an inode) would have to work very differently than in the current ext3 file system. Instead of using a bitmap to find an unused, pre-allocated and pre-initialized inode, you would have to allocate an empty block and initialize it yourself. So, you'll probably want to look at how the file system allocates blocks when it's writing to a file, then mimic that for allocating an inode.
An alternative scheme would be to store the inode inside the directory. So you save an I/O not because the inode is next to its data, but because when you lookup the filename you also read the inode. This was done back in the 90s as an experiment in BSD's FFS file system, and was written up in an excellent USENIX Paper. Those ideas never made it into FFS, or into any other main stream file system that I'm aware of, so it might be interesting to see how they work in ext3.
Regardless of whether you pursue one of these schemes or come up with something of your own, you'll also have to modify mke2fs to initialize the file system on disk in a way that your new file system variant will understand.
Good luck! It sounds like a fun project.
Kudos for getting into file system design!
First, a bit of engineering advice before you get too deep into hacking: make a copy of the ext3 tree and rename the file system to something else. I've found that when introducing experimental changes into a file system, you really don't want it to be used for your main system. Your system should still boot even if you introduce a bug that randomly loses files (it will eventually happen). You'll also need to branch the ext3 userspace tools to work with your new system.
Second, go get a copy of Understanding the Linux Kernel, 3 ed. by Bovet and Cesati. It presents an organized view of kernel subsystems, and I've found its explanations to be worthwhile. It's written for an older kernel (2.6.x for some x < 15; I forget exactly), but it's still accurate in many places. Read through its descriptions of file systems. I believe it covers ext3.
Third, about your actual project, you aren't proposing a simple modification to ext3. That file system has a pretty straightforward way of mapping an inode number to a disk block. You'll need to find a new way of doing this mapping. I would not anticipate any changes to the rest of ext3. Solving this challenge may be one of the key design points of your architecture. Note that keeping around a big array of inode -> disk block maps doesn't solve your problem: it's probably no better than existing ext3.

The FAT, Linux, and NTFS file systems

I heard that the NTFS file system is basically a b-tree. Is that true? What about the other file systems? What kind of trees are they?
Also, how is FAT32 different from FAT16?
What kind of tree are the FAT file systems using?
FAT (FAT12, FAT16, and FAT32) do not use a tree of any kind. Two interesting data structures are used, in addition to a block of data describing the partition itself. Full details at the level required to write a compatible implementation in an embedded system are available from Microsoft and third parties. Wikipedia has a decent article as an alternative starting point that also includes a lot of the history of how it got the way it is.
Since the original question was about the use of trees, I'll provide a quick summary of what little data structure is actually in a FAT file system. Refer to the above references for accurate details and for history.
The set of files in each directory is stored in a simple list, initially in the order the files were created. Deletion is done by marking an entry as deleted, so a subsequent file creation might re-use that slot. Each entry in the list is a fixed size struct, and is just large enough to hold the classic 8.3 file name along with the flag bits, size, dates, and the starting cluster number. Long file names (which also includes international character support) is done by using extra directory entry slots to hold the long name alongside the original 8.3 slot that holds all the rest of the file attributes.
Each file on the disk is stored in a sequence of clusters, where each cluster is a fixed number of adjacent disk blocks. Each directory (except the root directory of a disk) is just like a file, and can grow as needed by allocating additional clusters.
Clusters are managed by the (misnamed) File Allocation Table from which the file system gets its common name. This table is a packed array of slots, one for each cluster in the disk partition. The name FAT12 implies that each slot is 12 bits wide, FAT16 slots are 16 bits, and FAT32 slots are 32 bits. The slot stores code values for empty, last, and bad clusters, or the cluster number of the next cluster of the file. In this way, the actual content of a file is represented as a linked list of clusters called a chain.
Larger disks require wider FAT entries and/or larger allocation units. FAT12 is essentially only found on floppy disks where its upper bound of 4K clusters makes sense for media that was never much more than 1MB in size. FAT16 and FAT32 are both commonly found on thumb drives and flash cards. The choice of FAT size there depends partly on the intended application.
Access to the content of a particular file is straightforward. From its directory entry you learn its total size in bytes and its first cluster number. From the cluster number, you can immediately calculate the address of the first logical disk block. From the FAT indexed by cluster number, you find each allocated cluster in the chain assigned to that file.
Discovery of free space suitable for storage of a new file or extending an existing file is not as easy. The FAT file system simply marks free clusters with a code value. Finding one or more free clusters requires searching the FAT.
Locating the directory entry for a file is not fast either since the directories are not ordered, requiring a linear time search through the directory for the desired file. Note that long file names increase the search time by occupying multiple directory entries for each file with a long name.
FAT still has the advantage that it is simple enough to implement that it can be done in small microprocessors so that data interchange between even small embedded systems and PCs can be done in a cost effective way. I suspect that its quirks and oddities will be with us for a long time as a result.
ext3 and ext4 use "H-trees", which are apparently a specialized form of B-tree.
BTRFS uses B-trees (B-Tree File System).
ReiserFS uses B+trees, which are apparently what NTFS uses.
By the way, if you search for these on Wikipedia, it's all listed in the info box on the right side under "Directory contents".
Here is a nice chart on FAT16 vs FAT32.
The numerals in the names FAT16 and
FAT32 refer to the number of bits
required for a file allocation table
entry.
FAT16 uses a 16-bit file allocation
table entry (2 16 allocation units).
Windows 2000 reserves the first 4 bits
of a FAT32 file allocation table
entry, which means FAT32 has a maximum
of 2 28 allocation units. However,
this number is capped at 32 GB by the
Windows 2000 format utilities.
http://technet.microsoft.com/en-us/library/cc940351.aspx
FAT32 uses 32bit numbers to store cluster numbers. It supports larger disks and files up to 4 GiB in size.
As far as I understand the topic, FAT uses File Allocation Tables which are used to store data about status on disk. It appears that it doesn't use trees. I could be wrong though.

Resources