Size of a file in the FAT system - fat

I know that in recent file systems as EXT the i-node contains the actual size of the file. So if I want to know the size of a file I just need to read the meta-data.
How is this done in the FAT system (since there aren't i-nodes)? Do the OS need to go through all the blocks that contains the file and sum the size of each one?

If we look at the layout, we can see that each entry in the table has a Starting Cluster and File Size in Bytes.
FAT32's Records contain the same data.

Related

What is the real size of file?

How it is possible that text file has size, which is equal to number of chars inside? For example in file.txt you have string "abc" and size of it is 3 bytes. Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
I checked it on Windows, but at Unix systems situation is probalby the same.
When the file is written to disk, it is by means of low level system call like write() and operating systems know exactly how many bytes they write in a given file on a disk. This information, as well as several others (creation and modification date, ownership, etc) is written with the file.
In linux (and generally unix), it is by means of an inodethat fully describes the file. Informations stored in these inodes are:
* access mode
* ids of user and group that owns the file
* size in bytes
* date of creation, modification and access
* list of disk blocks containing file data
These are more or less the informations that are displayed by ls -l
You can also see inode number of each file with ls -i
You can find here additional details on inodes.
Other informations are coded differently. Names, for instance are only in special files describing a directory, not in the inode. A directory is indeed a list that associate a name with an inode.
Icons are generally defined system wide and the association of an icon with a file is done with either filename (and file extension) or with a file "type" that is written in the "inode" (or its equivalent in other OS).
Disks allocate space in blocks. Blocks historically were 512 bytes but that has increased over the years so that 4K is common. Your file size will always be a multiple of the block size.
Most file systems (and Windoze does this) allocate disk space in clusters. A cluster is a number of adjacent blocks. Your file size then will always be a multiple of the block size times the cluster factor. This is usually the size of the file as counted by the operating system.
This all depends upon the disk format and the operating system:
Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
The file information (date, owner, etc.) are usually in some kind of master file table. Sometimes this table will have extensions where information can be stored. Security information is often store in such overflows.
A rationally designed file system will have "A" filename stored in the header. File names are also stored in directories and a file can have multiple names if it is linked to multiple directories. The header file name is used to restored the file in case of corruption.
The location of an icon is entirely system specific and can be done in many ways. In the case of executable files, they are often stored in the file itself. They can also be hidden files in the same directory.

FAT System Identification of free space and structure of entry files?

Been seaching google for a good explanation for how FAT systems identify free space and the structure of FAT Entry files.
Alot of the explanations ive found are quite hard to follow can anyone help brief sum these up?
i understand that clusters are marked as unused but is this within the root directory or data region? and is the information on clusters status just marked in a table?
I haven't managed to gain any knowledge on the structure of the entry files either, just that they use chains to keep the clusters together
Anyone help?
A file system can be thought of having three (3) types of data: file data, file meta-data and file system meta-data. File data is file or directory contents. File meta-data is that which tells us where the file data is stored on the disk. File system meta-data tells us how the file system allocates the blocks used in the file system.
The FAT file system however does not keep the lines so clear cut. Its disk structures often blur these distinctions.
The File Allocation Table (FAT) itself blurs the lines of the file meta-data and file system meta-data. That is, the FAT entries identify both the cluster number of the where the next cluster of file (or directory) data can be found as well as indicating to the file system whether the cluster identified by the index into the FAT is available (or not). As you indicated in your question, this forms a chain. A special marker (the specific value escapes my memory) indicates that the cluster identified by the index into the FAT is the last cluster in the chain.
Directory entries in a FAT based file system are both file data and file meta-data. They read like files with their entries being the "file data". However, their entries are also interpreted as file meta-data, for they contain the file attributes (permissions, file size, and the starting cluster number--which is an index into the FAT).
The root directory is a special directory on a FAT file system. If memory serves, it does not have either a "." nor a ".." entry. On FAT12 and FAT16 systems, the size of the root directory is specified when the disk is formatted and is thus of fixed size--however, its clusters are still marked in the FAT. On FAT32, the root directory size is not set at format time and can grow. The starting cluster of the root directory is stored in a special field in one of the file system meta-data structures (as I'm going by memory the name of this structure eludes me).
Hope this helps.
Here is a fairly long article that has lots of information about fat file systems.
It should provide all the details you need.
http://en.wikipedia.org/wiki/File_Allocation_Table

Pack files using c so that can be unpacked to original files

I have to pack few files in such a way so that at some later stage i can unpack them again to the original files using c program, please suggest.
I suppose the explanation for wanting to write your own implementation might be curiosity.
Whether you add compression or not, if you simply want to store files in an archive, similar to the tar command, then you have a few possible approaches.
One of the fundamental choices you have to make is: how to demarcate the boundaries of the packed files within the archive? It is not a great idea to use a special character, because the packed files could contain any character to begin with.
To keep track of the end of files, you can use the length of the file in bytes. For example, you could, for each file:
Write to the archive the '\0' terminated C-string which names the packed file.
Write to the archive an off64_t which gives the length, in bytes, of the packed file.
Write to the archive the actual bytes (if any) of the packed file.
(Optional) Write to the archive a checksum or CRC of the packed file.
Repeatedly perform this for each file, concatenating the results with no intervening characters.
Finally, when no files remain, write an empty C-string, a zero character.
The unpacking process is:
Read the '\0'-terminated C-string which names this packed file.
If the name is empty, assert that we have read the entire archive, then exit.
Read the off64_t which gives the length of the packed file.
Read as many bytes as the packed file length from the archive and write to the newly-created unpacked file.
Again, repeat these steps until step (2) concludes the program.
This design, in which file names alternate with file data is workable. It has some drawbacks. The essential problem is that the data structure isn't designed for random access. In order to get the information about a file in the "middle" of the archive, a program is required to process the preceding files. The program can call lseek_64 to skip reading program data that isn't needed, but a processor needs to read at least each file name and each file length. The file length is needed to skip over the file data. The file name, as I arranged the data, must be read in order to locate the file length.
So this is inefficient. Even if the file names did not have to be read in order to access file size, the fact that the file details are sprinkled throughout the archive mean that reading the index data requires accessing several ranges of data on the disk.
A better approach might be to write a "block" of index data to the front of the file. This data structure might be something like:
The size of the first file in the archive.
The name of the first file in the archive.
The position, in bytes, within this archive, where the "first file" may be located as a contiguous block of bytes.
The size of the second file in the archive...
And the data in the index might repeat until, again, a file with empty name marks the end of the index.
Having an index like this is nice, but presents a difficulty: when the user wishes to append a file to the archive, the index might need to grow in size. This could change the locations of the packed files within the archive -- the archive program may need to move them around to make room for a bigger index.
The file structure can get more and more complex in order to serve all these different needs. For example, the index can be designed so that it is always allocated out of what the file system considers a "page" (the amount the OS reads or writes from the disk as a minimum-size granule), and if the index needs to grow, discontiguous "index pages" are chained together by file-position data leading from one index page to another. (Like a linked list, but on disk.) The complexity can go on and on.
A fast solution would be to take advantage of an external library like zLib (usage example: http://zlib.net/zlib_how.html ) and use it for compression.
If you want to dig deeper into the topic of compression, have a look at the different lossless compression algorithms and further hints at Wikipedia - Data compression.
I wrote a tar like program a couple of day ago, here my implementation (hope you can get some ideas):
Each file is stored in the file archive with an "header", which is like:
<file-type,file-path,file-size,file-mode>
in file-type i used 0 for files and 1 for directories (in this way you can recreate the directories tree)
For example, the header of a file named foo.txt of size 245 bytes with mode 0755 (on unix, see chmod) will looks like:
<0,foo.txt,245,0755>
here the file contents
in this way, the first character of the file archive is always a <, then you parse the list separated by commas (first possible bug) and extract the file type, the path, the size (which you will use to read the next size bytes from the archive - to avoid the "special character bug" pointed out by Heath Hunnicutt) and the mode of the file (let's say you have a binary file and you want to have it executable when extracted too, you need to chmod it with the original file mode).
About the first possible bug, a comma is not commonly used in a file name, but it's probably better to use another character or "sanitize" the path with a couple "" (sorry i don't remeber the name now, and english is not my mother tongue), obviously the parser should be aware of it, and ignore any comma in the "".
For writing and reading files in C see fgetc and fputc from stdio.h
To get file infos, chmod and directories tree see stat and chmod from sys/stat.h and ftw from ftw.h (probably linux/unix only, because is a system call).
Hope it helps! (if you need some code i can post some snippets, the header parsing is probably the hardest part).

How does the length of a filename affect remaining storage space on a disk?

How does the length of a filename affect remaining storage space on a disk?
I realize this is filesystem dependent. In particular I am thinking about the EXT series of file systems. I don't fully understand how inodes affect disk space and how the filename itself is stored. It's difficult to get relevant search results for this question too. That's why I'm asking here. On linux, the maximum file name length is usually 255 or 256 characters. When the file system is created, is that amount of space "reserved" for each and every file name? In other words, is disk storage not affected by the actual file name because the maximum is already used? Or is it more complicated than that?
Suppose, I have a file named "joe.txt" and rename it to "joe2.txt". Has the amount of available disk space decreased after this? What about longer names like "joe_version.txt" or "joe_original_version_with_bug_that_Jim_solved.txt"? I am worried about thresholds at 8, 16, 32, 64, etc characters. I will be storing millions of images. I have never bothered to worry about such an issue before so I'm not completely sure how this works.
Although EXT is the only filesystem I'm using, discussing FAT and others might be useful to somebody else that has a similar question.
On Linux (or more generally, Unix type filesystems) file names are stored in directory entry inodes, which contain a list of (filename, inode number) mappings for each file in the directory. My understanding is that for each filename there is reserved space for NAME_MAX characters. And indeed, on Linux NAME_MAX is 255.
So, to answer you question, when the file system is created there is no space reserved for file names, but once you create a file NAME_MAX bytes are reserved for the name. Moreover, for the directory inode, my understanding is that at least on ext2/3/4 space is allocated in disk block (4 KB, unless you're doing something very strange) granularity as needed. I.e. a directory takes up at minimum 4 KB (plus an entry in the parent directory inode), and if the list of (filename, inode) pairs doesn't fit into that 4 KB (minus other overhead, e.g. directory permissions), it allocates a new 4 KB block to continue the list, and so forth (ext2/3 uses an indirect block scheme, whereas ext4 uses extents).
FAT16 pre-allocates.
FAT32 uses a work-around to provide long filenames; as the filename becomes longer, additional directory file blocks are required to store the extra characters - and a directory file is a regular file, so this consumes additional disk space. However, the smallest allocation is one cluster, so unless the additional filename store exceeds the cluster boundary, no additional disk space is consumed from what you could otherwise have used.
I'm not offhand familiar with how filenames are handled in the UNIX type filesystems.

The FAT, Linux, and NTFS file systems

I heard that the NTFS file system is basically a b-tree. Is that true? What about the other file systems? What kind of trees are they?
Also, how is FAT32 different from FAT16?
What kind of tree are the FAT file systems using?
FAT (FAT12, FAT16, and FAT32) do not use a tree of any kind. Two interesting data structures are used, in addition to a block of data describing the partition itself. Full details at the level required to write a compatible implementation in an embedded system are available from Microsoft and third parties. Wikipedia has a decent article as an alternative starting point that also includes a lot of the history of how it got the way it is.
Since the original question was about the use of trees, I'll provide a quick summary of what little data structure is actually in a FAT file system. Refer to the above references for accurate details and for history.
The set of files in each directory is stored in a simple list, initially in the order the files were created. Deletion is done by marking an entry as deleted, so a subsequent file creation might re-use that slot. Each entry in the list is a fixed size struct, and is just large enough to hold the classic 8.3 file name along with the flag bits, size, dates, and the starting cluster number. Long file names (which also includes international character support) is done by using extra directory entry slots to hold the long name alongside the original 8.3 slot that holds all the rest of the file attributes.
Each file on the disk is stored in a sequence of clusters, where each cluster is a fixed number of adjacent disk blocks. Each directory (except the root directory of a disk) is just like a file, and can grow as needed by allocating additional clusters.
Clusters are managed by the (misnamed) File Allocation Table from which the file system gets its common name. This table is a packed array of slots, one for each cluster in the disk partition. The name FAT12 implies that each slot is 12 bits wide, FAT16 slots are 16 bits, and FAT32 slots are 32 bits. The slot stores code values for empty, last, and bad clusters, or the cluster number of the next cluster of the file. In this way, the actual content of a file is represented as a linked list of clusters called a chain.
Larger disks require wider FAT entries and/or larger allocation units. FAT12 is essentially only found on floppy disks where its upper bound of 4K clusters makes sense for media that was never much more than 1MB in size. FAT16 and FAT32 are both commonly found on thumb drives and flash cards. The choice of FAT size there depends partly on the intended application.
Access to the content of a particular file is straightforward. From its directory entry you learn its total size in bytes and its first cluster number. From the cluster number, you can immediately calculate the address of the first logical disk block. From the FAT indexed by cluster number, you find each allocated cluster in the chain assigned to that file.
Discovery of free space suitable for storage of a new file or extending an existing file is not as easy. The FAT file system simply marks free clusters with a code value. Finding one or more free clusters requires searching the FAT.
Locating the directory entry for a file is not fast either since the directories are not ordered, requiring a linear time search through the directory for the desired file. Note that long file names increase the search time by occupying multiple directory entries for each file with a long name.
FAT still has the advantage that it is simple enough to implement that it can be done in small microprocessors so that data interchange between even small embedded systems and PCs can be done in a cost effective way. I suspect that its quirks and oddities will be with us for a long time as a result.
ext3 and ext4 use "H-trees", which are apparently a specialized form of B-tree.
BTRFS uses B-trees (B-Tree File System).
ReiserFS uses B+trees, which are apparently what NTFS uses.
By the way, if you search for these on Wikipedia, it's all listed in the info box on the right side under "Directory contents".
Here is a nice chart on FAT16 vs FAT32.
The numerals in the names FAT16 and
FAT32 refer to the number of bits
required for a file allocation table
entry.
FAT16 uses a 16-bit file allocation
table entry (2 16 allocation units).
Windows 2000 reserves the first 4 bits
of a FAT32 file allocation table
entry, which means FAT32 has a maximum
of 2 28 allocation units. However,
this number is capped at 32 GB by the
Windows 2000 format utilities.
http://technet.microsoft.com/en-us/library/cc940351.aspx
FAT32 uses 32bit numbers to store cluster numbers. It supports larger disks and files up to 4 GiB in size.
As far as I understand the topic, FAT uses File Allocation Tables which are used to store data about status on disk. It appears that it doesn't use trees. I could be wrong though.

Resources