How does the length of a filename affect remaining storage space on a disk? - filesystems

How does the length of a filename affect remaining storage space on a disk?
I realize this is filesystem dependent. In particular I am thinking about the EXT series of file systems. I don't fully understand how inodes affect disk space and how the filename itself is stored. It's difficult to get relevant search results for this question too. That's why I'm asking here. On linux, the maximum file name length is usually 255 or 256 characters. When the file system is created, is that amount of space "reserved" for each and every file name? In other words, is disk storage not affected by the actual file name because the maximum is already used? Or is it more complicated than that?
Suppose, I have a file named "joe.txt" and rename it to "joe2.txt". Has the amount of available disk space decreased after this? What about longer names like "joe_version.txt" or "joe_original_version_with_bug_that_Jim_solved.txt"? I am worried about thresholds at 8, 16, 32, 64, etc characters. I will be storing millions of images. I have never bothered to worry about such an issue before so I'm not completely sure how this works.
Although EXT is the only filesystem I'm using, discussing FAT and others might be useful to somebody else that has a similar question.

On Linux (or more generally, Unix type filesystems) file names are stored in directory entry inodes, which contain a list of (filename, inode number) mappings for each file in the directory. My understanding is that for each filename there is reserved space for NAME_MAX characters. And indeed, on Linux NAME_MAX is 255.
So, to answer you question, when the file system is created there is no space reserved for file names, but once you create a file NAME_MAX bytes are reserved for the name. Moreover, for the directory inode, my understanding is that at least on ext2/3/4 space is allocated in disk block (4 KB, unless you're doing something very strange) granularity as needed. I.e. a directory takes up at minimum 4 KB (plus an entry in the parent directory inode), and if the list of (filename, inode) pairs doesn't fit into that 4 KB (minus other overhead, e.g. directory permissions), it allocates a new 4 KB block to continue the list, and so forth (ext2/3 uses an indirect block scheme, whereas ext4 uses extents).

FAT16 pre-allocates.
FAT32 uses a work-around to provide long filenames; as the filename becomes longer, additional directory file blocks are required to store the extra characters - and a directory file is a regular file, so this consumes additional disk space. However, the smallest allocation is one cluster, so unless the additional filename store exceeds the cluster boundary, no additional disk space is consumed from what you could otherwise have used.
I'm not offhand familiar with how filenames are handled in the UNIX type filesystems.

Related

What is bigger a .txt file with many lines or many .txt file of one line each

I have a folder with 1716 .txt files of one line each (~6300 characters per line)
if I merge all the files in one and create a .txt file with 1716 lines, that file will have bigger, smaller or the same size?
Depends on how you want to look at it. If you only want to know about the characters stored, then it is the same amount of characters in the one file as in the 1716 individual files, so "same size" in that sense. In terms of disk allocation, most operating systems support a fixed disk block or cluster size, making the minimum amount of disk space a file can take up equal to that disk block / cluster size. As a result, the disk space one file with 1716 lines would take up could be either smaller or equal to the 1716 individual .txt files. If you have many files that are potentially wasting a cluster by not fully populating it, the problem is mitigated with one file containing all of the characters, but likely only wasting one cluster.
This information is based on some quick searches about filesystems I had to make to refresh my memory, and from an Operating Systems course I took recently.
Inode structure: https://en.wikipedia.org/wiki/Inode
Data cluster: https://en.wikipedia.org/wiki/Data_cluster
ECE344- Operating Systems (University of Toronto)

What is the real size of file?

How it is possible that text file has size, which is equal to number of chars inside? For example in file.txt you have string "abc" and size of it is 3 bytes. Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
I checked it on Windows, but at Unix systems situation is probalby the same.
When the file is written to disk, it is by means of low level system call like write() and operating systems know exactly how many bytes they write in a given file on a disk. This information, as well as several others (creation and modification date, ownership, etc) is written with the file.
In linux (and generally unix), it is by means of an inodethat fully describes the file. Informations stored in these inodes are:
* access mode
* ids of user and group that owns the file
* size in bytes
* date of creation, modification and access
* list of disk blocks containing file data
These are more or less the informations that are displayed by ls -l
You can also see inode number of each file with ls -i
You can find here additional details on inodes.
Other informations are coded differently. Names, for instance are only in special files describing a directory, not in the inode. A directory is indeed a list that associate a name with an inode.
Icons are generally defined system wide and the association of an icon with a file is done with either filename (and file extension) or with a file "type" that is written in the "inode" (or its equivalent in other OS).
Disks allocate space in blocks. Blocks historically were 512 bytes but that has increased over the years so that 4K is common. Your file size will always be a multiple of the block size.
Most file systems (and Windoze does this) allocate disk space in clusters. A cluster is a number of adjacent blocks. Your file size then will always be a multiple of the block size times the cluster factor. This is usually the size of the file as counted by the operating system.
This all depends upon the disk format and the operating system:
Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
The file information (date, owner, etc.) are usually in some kind of master file table. Sometimes this table will have extensions where information can be stored. Security information is often store in such overflows.
A rationally designed file system will have "A" filename stored in the header. File names are also stored in directories and a file can have multiple names if it is linked to multiple directories. The header file name is used to restored the file in case of corruption.
The location of an icon is entirely system specific and can be done in many ways. In the case of executable files, they are often stored in the file itself. They can also be hidden files in the same directory.

How to truncate a file in FAT32 file system without zero padding in C?

I have 2TB HDD contains a single partition of FAT32 file system. While truncating a file to larger size say 100MB or 200MB using ftruncate(), it takes time of 5 to 10 seconds to do zero padding. Is there any way of file truncation that take a less time or do it without zero padding?
No. Ftruncate specifically adds zeros to the end according to the GNU C library page. It also says you could try and just use truncate, and it will try to fill it with holes instead, but this isn't supported on all machines so you might not be able to do that.
You could try and hack the file system by opening the inode and changing the value to make the File system think the file is larger, but this is a security issue.

The FAT, Linux, and NTFS file systems

I heard that the NTFS file system is basically a b-tree. Is that true? What about the other file systems? What kind of trees are they?
Also, how is FAT32 different from FAT16?
What kind of tree are the FAT file systems using?
FAT (FAT12, FAT16, and FAT32) do not use a tree of any kind. Two interesting data structures are used, in addition to a block of data describing the partition itself. Full details at the level required to write a compatible implementation in an embedded system are available from Microsoft and third parties. Wikipedia has a decent article as an alternative starting point that also includes a lot of the history of how it got the way it is.
Since the original question was about the use of trees, I'll provide a quick summary of what little data structure is actually in a FAT file system. Refer to the above references for accurate details and for history.
The set of files in each directory is stored in a simple list, initially in the order the files were created. Deletion is done by marking an entry as deleted, so a subsequent file creation might re-use that slot. Each entry in the list is a fixed size struct, and is just large enough to hold the classic 8.3 file name along with the flag bits, size, dates, and the starting cluster number. Long file names (which also includes international character support) is done by using extra directory entry slots to hold the long name alongside the original 8.3 slot that holds all the rest of the file attributes.
Each file on the disk is stored in a sequence of clusters, where each cluster is a fixed number of adjacent disk blocks. Each directory (except the root directory of a disk) is just like a file, and can grow as needed by allocating additional clusters.
Clusters are managed by the (misnamed) File Allocation Table from which the file system gets its common name. This table is a packed array of slots, one for each cluster in the disk partition. The name FAT12 implies that each slot is 12 bits wide, FAT16 slots are 16 bits, and FAT32 slots are 32 bits. The slot stores code values for empty, last, and bad clusters, or the cluster number of the next cluster of the file. In this way, the actual content of a file is represented as a linked list of clusters called a chain.
Larger disks require wider FAT entries and/or larger allocation units. FAT12 is essentially only found on floppy disks where its upper bound of 4K clusters makes sense for media that was never much more than 1MB in size. FAT16 and FAT32 are both commonly found on thumb drives and flash cards. The choice of FAT size there depends partly on the intended application.
Access to the content of a particular file is straightforward. From its directory entry you learn its total size in bytes and its first cluster number. From the cluster number, you can immediately calculate the address of the first logical disk block. From the FAT indexed by cluster number, you find each allocated cluster in the chain assigned to that file.
Discovery of free space suitable for storage of a new file or extending an existing file is not as easy. The FAT file system simply marks free clusters with a code value. Finding one or more free clusters requires searching the FAT.
Locating the directory entry for a file is not fast either since the directories are not ordered, requiring a linear time search through the directory for the desired file. Note that long file names increase the search time by occupying multiple directory entries for each file with a long name.
FAT still has the advantage that it is simple enough to implement that it can be done in small microprocessors so that data interchange between even small embedded systems and PCs can be done in a cost effective way. I suspect that its quirks and oddities will be with us for a long time as a result.
ext3 and ext4 use "H-trees", which are apparently a specialized form of B-tree.
BTRFS uses B-trees (B-Tree File System).
ReiserFS uses B+trees, which are apparently what NTFS uses.
By the way, if you search for these on Wikipedia, it's all listed in the info box on the right side under "Directory contents".
Here is a nice chart on FAT16 vs FAT32.
The numerals in the names FAT16 and
FAT32 refer to the number of bits
required for a file allocation table
entry.
FAT16 uses a 16-bit file allocation
table entry (2 16 allocation units).
Windows 2000 reserves the first 4 bits
of a FAT32 file allocation table
entry, which means FAT32 has a maximum
of 2 28 allocation units. However,
this number is capped at 32 GB by the
Windows 2000 format utilities.
http://technet.microsoft.com/en-us/library/cc940351.aspx
FAT32 uses 32bit numbers to store cluster numbers. It supports larger disks and files up to 4 GiB in size.
As far as I understand the topic, FAT uses File Allocation Tables which are used to store data about status on disk. It appears that it doesn't use trees. I could be wrong though.

Basic concepts in file system implementation

I am a unclear about file system implementation. Specifically (Operating Systems - Tannenbaum (Edition 3), Page 275) states "The first word of each block is used as a pointer to the next one. The rest of block is data".
Can anyone please explain to me the hierarchy of the division here? Like, each disk partition contains blocks, blocks contain words? and so on...
I don't have the book in front of me, but I'm suspect that quoted sentence isn't really talking about files, directories, or other file system structures. (Note that a partition isn't a file system concept, generally). I think your quoted sentence is really just pointing out something about how the data structures stored in disk blocks are chained together. It means just what it says. Each block (usually 4k, but maybe just 512B) looks very roughly like this:
+------------------+------------- . . . . --------------+
| next blk pointer | another 4k - 4 or 8 bytes of stuff |
+------------------+------------- . . . . --------------+
The stuff after the next block pointer depends on what's stored in this particular block. From just the sentence given, I can't tell how the code figures that out.
With regard to file system structures:
A disk is an array of sectors, almost always 512B in size. Internally, disks are built of platters, which are the spinning disk-shaped things covered in rust, and each platter is divided up into many concentric tracks. However, these details are entirely hidden from the operating system by the ATA or SCSI disk interface hardware.
The operating system divides the array of sectors up into partitions. Partitions are contiguous ranges of sectors, and partitions don't overlap. (In fact this is allowed on some operating systems, but it's just confusing to think about.)
So, a partition is also an array of sectors.
So far, the file system isn't really in the picture yet. Most file systems are built within a partition. The file system usually has the following concepts. (The names I'm using are those from the unix tradition, but other operating systems will have similar ideas.)
At some fixed location on the partition is the superblock. The superblock is the root of all the file system data structures, and contains enough information to point to all the other entities. (In fact, there are usually multiple superblocks scattered across the partition as a simple form of fault tolerance.)
The fundamental concept of the file system is the inode, said "eye-node". Inodes represent the various types of objects that make up the file system, the most important being plain files and directories. An inode might be it's own block, but some file system pack multiple inodes into a single block. Inodes can point to a set of data blocks that make up the actual contents of the file or directory. How the data blocks for a file is organized and indexed on disk is one of the key tasks of a file system. For a directory, the data blocks hold information about files and subdirectories contained within the directory, and for a plain file, the data blocks hold the contents of the file.
Data blocks are the bulk of the blocks on the partition. Some are allocated to various inodes (ie, to directories and files), while others are free. Another key file system task is allocating free data blocks as data is written to files, and freeing data blocks from files when they are truncated or deleted.
There are many many variations on all of these concepts, and I'm sure there are file systems where what I've said above doesn't line up with reality very well. However, with the above, you should be in a position to reason about how file systems do their job, and understand, at least a bit, the differences you run across in any specific file system.
I don't know the context of this sentence, but it appears to be describing a linked list of blocks. Generally speaking, a "block" is a small number of bytes (usually a power of two). It might be 4096 bytes, it might be 512 bytes, it depends. Hard drives are designed to retrieve data a block at a time; if you want to get the 1234567th byte, you'll have to get the entire block it's in. A "word" is much smaller and refers to a single number. It may be as low as 2 bytes (16-bit) or as high as 8 bytes (64-bit); again, it depends on the filesystem.
Of course, blocks and words isn't all there is to filesystems. Filesystems typically implement a B-tree of some sort to make lookups fast (it won't have to search the whole filesystem to find a file, just walk down the tree). In a filesystem B-tree, each node is stored in a block. Many filesystems use a variant of the B-tree called a B+-tree, which connects the leaves together with links to make traversal faster. The structure described here might be describing the leaves of a B+-tree, or it might be describing a chain of blocks used to store a single large file.
In summary, a disk is like a giant array of bytes which can be broken down into words, which are usually 2-8 bytes, and blocks, which are usually 512-4096 bytes. There are other ways to break it down, such as heads, cylinders, sectors, etc.. On top of these primitives, higher-level index structures are implemented. By understanding the constraints a filesystem developer needs to satisfy (emulate a tree of files efficiently by storing/retrieving blocks at a time), filesystem design should be quite intuitive.
Tracks >> Blocks >> Sectors >> Words >> Bytes >> Nibbles >> Bits
Tracks are concentric rings from inside to the outside of the disk platter.
Each track is divided into slices called sectors.
A block is a group of sectors (1, 2, 4, 8, 16, etc). The bigger the drive, the more sectors that a block will hold.
A word is the number of bits a CPU can handle at once (16-bit, 32-bit, 64-bit, etc), and in your example, stores the address (or perhaps offset) of the next block.
Bytes contain nibbles and bits. 1 Byte = 2 Nibbles; 1 Nibble = 4 Bits.

Resources