I am in the middle of writing some software in C that recursively lists all files in a given directory and now I need to work out the internal fragmentation.
I have spent a long time researching this and have found out that the internal fragmentation on ext2 only occurs in the last block. I know that from an inode number in theory you should be able to get the first and last block addresses but I have no idea how.
I have looked into stat(), fcntl() and all sorts of ways. How do I get the last block address from an inode number?
I have also figured out that once I have the address of the last block that I can test to see how much free space is in that block and this will give me the internal fragmentation.
I know that there is a get_inode and a get_block command but have no idea apart from that!
I don't think you can get at the addresses of disk block via the regular system calls such as stat(). You would probably have to find the raw inode on disk (which means accessing the raw disk, and requires elevated privileges) and processing the data from there.
Classically, you'd find direct blocks, indirect blocks, double-indirect blocks and a triple-indirect block for a file. However, the relevant file system type is about as dead as the dodo is (I don't think I've seen that file system type this millennium), so that's unlikely to be much help now.
There might be a non-standard system call to get at the information, but I doubt it.
Maybe you think too complicated, but roughly the internal fragmentation should be able to calculated if you divide the file size by the block size and take the modulo.
But this is only valid if the file is a "classic one" - with sparse files or files holding much "other information" (such as huge ACLs or extended attributes), there might be a difference. (I don't know where they are stored, but I could imagine that there could be file systems storing them in the last block, effectively (but unnoticedly) reducing the internal fragmentation.)
Related
It's widely known that in general, when you delete a file on disk on most (all?) modern OSes, the bytes of that file aren't removed, but the space is 'freed' and not overwritten with other data until it's used for another write operation.
I'm also aware that on UNIX-like systems, I can read the bytes on disk directly from its representation in the filesystem at /dev/whatever.
However, /dev/whatever returns all the bytes of everything on the disk, including files that still 'exist' in the user-facing sense, and what I'd really like to do is identify and read only the bytes on disk that remain after the deletion of some resource - so the data that still exists in the 'free' space of the disk (I'm aware that file recovery and digital forensics tools exist which can recover these files, but for my purposes I need to do something slightly-closer-to-the-metal - I'm interested in getting a bytestream of the data remaining on the empty space of the disk, with no additional structure).
Therefore, is there any way I could access (for instance) the allocated and unallocated ranges of disk space programaticaly, then read the bytes corresponding to the unallocated ranges from disk? I'm pretty agnostic when it comes to programming language - I'm assuming this is going to involve low-level APIs callable from a little C programme, or something similar.
I'm assuming this is OS / Filesystem dependent - I'm on Mac OS X and APFS if this helps, but would appreciate tips for any combination of OS/FS as i'm eventually going to port this project to other platforms.
Any tips or insight much appreciated! Thank you.
I'm learning about in-kernel data transferring between two file descriptors in Linux and came across something I cannot understand. Here is the quote from copy_file_range manpage
copy_file_range() gives filesystems an opportunity to implement "copy
acceleration" techniques, such as the use of reflinks (i.e., two or
more i-nodes that share pointers to the same copy-on-write disk
blocks) or server-side-copy
I used to think of index nodes as something that is returned by stat/statx syscall. The st_ino type is typedefed here as
typedef unsigned long __kernel_ulong_t;
So what does it ever mean "two or more i-nodes that share pointers to the same copy-on-write disk blocks"?
According to my understanding the fact that copy_file_range do not need to pass the data through the user-mode means the kernel doesn't have to load the data from the disk at all (it still might but it doesn't have to) and this allows further optimization by pushing the operation down the file-system stack. This covers the case of the server-side-copy over NFS.
The actual answers about the other optimization starts with an intro into how files are stored, you may skip it if you already know that.
There are 3 layers in how files are stored in a typical Linux FS:
The file entry in some directory (which is itself a file containing a list of such entries). Such entry essentially maps file name to some inode. It is done by storing the inode-number aka st_ino which is effectively a pointer to the inode in some table.
The inode that contains some shared (see further) metadata (as the one returned by stat) and some pointer(s) to data block(s) that store the actual file contents.
The actual data blocks
So for example a hard-link is a record in some directory that points to the same inode as the "original" file (and incrementing the "link counter" inside the inode). This means that only file names (and possibly directories) are different, all the rest of the data and meta-data is shared between hard-links. Note that creating a hard link is a very fast way to copy a file. The only drawback is that both files now are bound to share their contents forever so this is not a true copy. But if we used some copy-on-write method to fix the "write" part, it would work very nice. This is what some FSes (such as Btrfs) support via reflinks.
The idea of this copy-on-wrote trick is that you can create a new inode with new appropriate metadata but still share the same data blocks. You also add cross-references between the two inodes in the "invisible" part of the inode metadata so they know they share the data blocks. Obviously this operation is very fast comparing to the real copying. And again as long as the files are only read, everything works perfectly. But unlike hard-link we can deal with writes treating them as independent as well. When some write is performed, the FS checks if the file (or rather the inode) is really the only owner of the data blocks and else copies the data before writing to it. Depending on the FS implementation it can copy the whole file on the first write or it can store some more detailed metadata and only copy the blocks that have to be modified and still share the rest between the files. In the later case blocks might not need to be copied at all if the write size is more than a block.
So the simplest trick copy_file_range() can do is to check if the whole file is actually being copied and if so, to perform the reflink trick described above (obviously if the FS supports it).
Some more advanced optimizations are also possible if the FS supports more detailed meta-data on data blocks. Assume you copy first N bytes from the start of the file into a new file. Then the FS can just share the starting blocks and probably has to copy only the last one that is not fully copied.
If you're here:
https://github.com/torvalds/linux/blob/master/fs/ext4/file.c#L360
You have access to these two structs inside the ext4_file_mmap function:
struct file *file, struct vm_area_struct *vma
I am changing the implementation of this function for dax mode so that the page tables get entirely filled out for the file the moment you call mmap (to see how much better performance not taking any pagefaults gives us).
I have managed to get the following done so far (assuming I have access to to the two structs that ext4_file_mmap has access to):
// vm_area_struct defined in /include/linux/mm_types.h : 284
// file defined in /include/linux/fs.h : 848
loff_t file_size = file_inode(file)->i_size;
unsigned long start_va = vma->vm_start;
Now, the difficulty lies here. How do I get the physical addresses (blocks? Not sure if dax uses blocks) associated with this file?
I have spent the last couple of days staring at the linux source code, trying to make sense of stuff, and boy have I been successful.
Any help, hint,or suggestion is greatly appreciated!
Thanks!
Some updates: When you mmap a file in dax mode, you don't fetch anything into memory. The device, in this case PMEM, is byte-addressable and gives DDR latencies, so it's accessed directly (no memory in between). Certain ptes lead to the access of this PMEM device instead of memory.
First of all mmap support MAP_POPULATE flag specifically to avoid page faults. In principle it may be it does not work with dax, but that's unlikely.
Second of all it seems you don't have any measurements of the current state of affairs. Just "changing something and checking the difference" is a fundamentally wrong approach. In particular it may be the actual bottleneck will be removed as an unintended consequence of the change and the win will end up being misattributed. You can start by using 'perf' to get basic numbers and generating flamegraphs ( http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html ). If you do a lot of i/o over a small range, page faults should have a negligible effect.
Instead of storing references to next nodes in a table, why couldn't it be just stored like a conventional linked list, that is, with a next pointer?
This is due to alignment. FAT (and just about any other file system) stores file data in one or more whole sectors of the underlying storage. Because the underlying storage can only read and write whole sectors such allocation allows efficient access to the contents of a file.
Issues with interleaving
When a program wants to store something in a file it provides a buffer, say 1MB of data to store. Now if the file's data sectors have to also keep next pointers to their next sector, this pointer information will need to be interleaved with the actual user data. So the file system would need to build another buffer (of slightly more than the provided 1MB), for each output sector copy some of the user data and the corresponding next pointer and give this new buffer to the storage. This would be somewhat inefficient. Unless the file system always stores file data to new sectors (and most usually don't), rewriting these next pointers will also be redundant.
The bigger problem would be when read operation is attempted on the file. Files will now work like tape devices: with only the location of the first sector known in the file's primary metadata, in order to reach sector 1000, the file system will need to read all sectors before it in order: read sector 0, find the address of sector 1 from the loaded next pointer, read sector 1, etc. With typical seek times of around 10 ms per random I/O (assuming a hard disk drive), reaching sector 1000 will take about 10 seconds. Even if sectors are sequentially ordered, while the file system driver processes sector N's data, the disk head will be flying over the next sector and when the read for sector N+1 is issued it may be too late, requiring the disk to rotate entire revolution (8.3ms for 7200 RPM drive) before being able to read the next sector again. On-disk cache can and will help with that though.
Writing single sector is usually atomic operation (depends on hardware): reading back the sector after power failure returns either its old content or the new one without intermediate states. Database applications usually need to know which writes would be atomic. If the file system interleaves file data and metadata in the same sectors, it will need to report smaller than the actual sector size to the application. For example instead of say 512 bytes it may need to report 504. But it can't do it because sector size is usually assumed by applications to be power of 2. Furthermore file stored on such filesystem would very likely be unusable if copied to another file system with different reported sector size.
Better approaches
The FAT format is better because all next pointers are stored in adjacent sectors. For FAT12, FAT16 and not very large FAT32 volumes the entire table is small enough to fit in memory. FAT still records the blocks of a file in a linked list, so to have efficient random access, an implementation needs to cache the chain per file. On large enough volumes (that can sport large enough file) such cache may no longer fit in memory.
ext3 uses direct and indirect blocks. This simple format avoids the need for preprocessing that FAT requires and goes by with only minimal amount of additional reads per I/O when indirect blocks are needed. These additional reads are cached by the operating system so that their overhead is often negligible.
Other variants are also possible and used by various file systems.
Random notes
For the sake of completeness, some hard disk drives can be formatted with slightly larger sector sizes (say 520 bytes) so that the file system can pack 512 bytes of file data with several bytes of metadata in the same sector. Yet because of the above, I don't believe anyone has used such formats for storing the address of the file's next sector. These additional bytes can be put to better use: additional checksums and timestamping come to mind. The timestamping I believe is used to improve the performance of some RAID systems. Still such usage is rare, and most software can't work with them at all.
Some file systems can save the content of small enough files in the file metadata directly without occupying distinct sectors. ReiserFS has the controversial tail packing. This is not important here: large files still benefit from having proper mapping to storage sectors.
Any modern OS requires much more than a pointer to the next data block for its file system: attributes (encryption, compression, hidden, ...), security descriptors (ACL list items), support for different hardware, buffering. This is just a tiny fraction of functionality that any good file system does.
Have a look at file system at Wikipedia to learn what else any modern file system does.
If we ignore the detail of FAT12 sharing a byte between two items to compact 12 bite as 1.5 bytes, then we can concentrate on the deeper meaning of the question.
It turns out that the FAT system is equivalent to a linked list with the following points:
The "next" pointer is located in an array (the FAT) instead of being appended or prepended to the actual data
The value written in "next" is an integer instead of the more familiar memory address of the next node.
The nodes are not reserved dynamically but represented by another array. That array is the entire data part of the hard drive.
One fascinating exercise we were assigned as part of the software engineer education was to convert an application using memory pointer to an equivalent application which use integer value. The rationale was that some processors (PDP-11? or another PDP-xx) would perform integer arithmetic much faster than memory pointer operation or maybe even did forbid any arithmetic on pointers.
I have a program which reads all the file system file/dir names, size etc. and populate in a tree data structure. Once this is done, it will generate a report.
I want write my program to collect and then report this data using memory in the most efficient way and without exceeding my heap space.
I worry that if the file system has a lot of files and dirs., it will consume a lot of memory and might eventually run out (malloc() will start to fail).
Eventually this is genuine memory consumption, Is there any methods/techniques to overcome this?
You could employ the Flyweight Design Pattern for each folder node.
http://en.wikipedia.org/wiki/Flyweight_pattern
Instead of storing the full path for each item, you could have a bit array of pointers to partial paths (folder names). These could then be easily reconstructed if needed.
It also depends on what you need for your report. Do you need to hold all the information in memory during construction, or could you just accumulate some of the space count variables as you traverse the tree?
Perhaps using valgrind or Boehm's garbage collector could help you (at least on Linux).