Copy acceleration with copy_file_range

Copy acceleration with copy_file_range - c

I'm learning about in-kernel data transferring between two file descriptors in Linux and came across something I cannot understand. Here is the quote from copy_file_range manpage
copy_file_range() gives filesystems an opportunity to implement "copy
acceleration" techniques, such as the use of reflinks (i.e., two or
more i-nodes that share pointers to the same copy-on-write disk
blocks) or server-side-copy
I used to think of index nodes as something that is returned by stat/statx syscall. The st_ino type is typedefed here as
typedef unsigned long __kernel_ulong_t;
So what does it ever mean "two or more i-nodes that share pointers to the same copy-on-write disk blocks"?

According to my understanding the fact that copy_file_range do not need to pass the data through the user-mode means the kernel doesn't have to load the data from the disk at all (it still might but it doesn't have to) and this allows further optimization by pushing the operation down the file-system stack. This covers the case of the server-side-copy over NFS.
The actual answers about the other optimization starts with an intro into how files are stored, you may skip it if you already know that.
There are 3 layers in how files are stored in a typical Linux FS:
The file entry in some directory (which is itself a file containing a list of such entries). Such entry essentially maps file name to some inode. It is done by storing the inode-number aka st_ino which is effectively a pointer to the inode in some table.
The inode that contains some shared (see further) metadata (as the one returned by stat) and some pointer(s) to data block(s) that store the actual file contents.
The actual data blocks
So for example a hard-link is a record in some directory that points to the same inode as the "original" file (and incrementing the "link counter" inside the inode). This means that only file names (and possibly directories) are different, all the rest of the data and meta-data is shared between hard-links. Note that creating a hard link is a very fast way to copy a file. The only drawback is that both files now are bound to share their contents forever so this is not a true copy. But if we used some copy-on-write method to fix the "write" part, it would work very nice. This is what some FSes (such as Btrfs) support via reflinks.
The idea of this copy-on-wrote trick is that you can create a new inode with new appropriate metadata but still share the same data blocks. You also add cross-references between the two inodes in the "invisible" part of the inode metadata so they know they share the data blocks. Obviously this operation is very fast comparing to the real copying. And again as long as the files are only read, everything works perfectly. But unlike hard-link we can deal with writes treating them as independent as well. When some write is performed, the FS checks if the file (or rather the inode) is really the only owner of the data blocks and else copies the data before writing to it. Depending on the FS implementation it can copy the whole file on the first write or it can store some more detailed metadata and only copy the blocks that have to be modified and still share the rest between the files. In the later case blocks might not need to be copied at all if the write size is more than a block.
So the simplest trick copy_file_range() can do is to check if the whole file is actually being copied and if so, to perform the reflink trick described above (obviously if the FS supports it).
Some more advanced optimizations are also possible if the FS supports more detailed meta-data on data blocks. Assume you copy first N bytes from the start of the file into a new file. Then the FS can just share the starting blocks and probably has to copy only the last one that is not fully copied.

Related

Why did Windows use the FAT structure instead of a conventional linked list with a next pointer for each data block of a file?

Instead of storing references to next nodes in a table, why couldn't it be just stored like a conventional linked list, that is, with a next pointer?

This is due to alignment. FAT (and just about any other file system) stores file data in one or more whole sectors of the underlying storage. Because the underlying storage can only read and write whole sectors such allocation allows efficient access to the contents of a file.
Issues with interleaving
When a program wants to store something in a file it provides a buffer, say 1MB of data to store. Now if the file's data sectors have to also keep next pointers to their next sector, this pointer information will need to be interleaved with the actual user data. So the file system would need to build another buffer (of slightly more than the provided 1MB), for each output sector copy some of the user data and the corresponding next pointer and give this new buffer to the storage. This would be somewhat inefficient. Unless the file system always stores file data to new sectors (and most usually don't), rewriting these next pointers will also be redundant.
The bigger problem would be when read operation is attempted on the file. Files will now work like tape devices: with only the location of the first sector known in the file's primary metadata, in order to reach sector 1000, the file system will need to read all sectors before it in order: read sector 0, find the address of sector 1 from the loaded next pointer, read sector 1, etc. With typical seek times of around 10 ms per random I/O (assuming a hard disk drive), reaching sector 1000 will take about 10 seconds. Even if sectors are sequentially ordered, while the file system driver processes sector N's data, the disk head will be flying over the next sector and when the read for sector N+1 is issued it may be too late, requiring the disk to rotate entire revolution (8.3ms for 7200 RPM drive) before being able to read the next sector again. On-disk cache can and will help with that though.
Writing single sector is usually atomic operation (depends on hardware): reading back the sector after power failure returns either its old content or the new one without intermediate states. Database applications usually need to know which writes would be atomic. If the file system interleaves file data and metadata in the same sectors, it will need to report smaller than the actual sector size to the application. For example instead of say 512 bytes it may need to report 504. But it can't do it because sector size is usually assumed by applications to be power of 2. Furthermore file stored on such filesystem would very likely be unusable if copied to another file system with different reported sector size.
Better approaches
The FAT format is better because all next pointers are stored in adjacent sectors. For FAT12, FAT16 and not very large FAT32 volumes the entire table is small enough to fit in memory. FAT still records the blocks of a file in a linked list, so to have efficient random access, an implementation needs to cache the chain per file. On large enough volumes (that can sport large enough file) such cache may no longer fit in memory.
ext3 uses direct and indirect blocks. This simple format avoids the need for preprocessing that FAT requires and goes by with only minimal amount of additional reads per I/O when indirect blocks are needed. These additional reads are cached by the operating system so that their overhead is often negligible.
Other variants are also possible and used by various file systems.
Random notes
For the sake of completeness, some hard disk drives can be formatted with slightly larger sector sizes (say 520 bytes) so that the file system can pack 512 bytes of file data with several bytes of metadata in the same sector. Yet because of the above, I don't believe anyone has used such formats for storing the address of the file's next sector. These additional bytes can be put to better use: additional checksums and timestamping come to mind. The timestamping I believe is used to improve the performance of some RAID systems. Still such usage is rare, and most software can't work with them at all.
Some file systems can save the content of small enough files in the file metadata directly without occupying distinct sectors. ReiserFS has the controversial tail packing. This is not important here: large files still benefit from having proper mapping to storage sectors.

Any modern OS requires much more than a pointer to the next data block for its file system: attributes (encryption, compression, hidden, ...), security descriptors (ACL list items), support for different hardware, buffering. This is just a tiny fraction of functionality that any good file system does.
Have a look at file system at Wikipedia to learn what else any modern file system does.

If we ignore the detail of FAT12 sharing a byte between two items to compact 12 bite as 1.5 bytes, then we can concentrate on the deeper meaning of the question.
It turns out that the FAT system is equivalent to a linked list with the following points:
The "next" pointer is located in an array (the FAT) instead of being appended or prepended to the actual data
The value written in "next" is an integer instead of the more familiar memory address of the next node.
The nodes are not reserved dynamically but represented by another array. That array is the entire data part of the hard drive.
One fascinating exercise we were assigned as part of the software engineer education was to convert an application using memory pointer to an equivalent application which use integer value. The rationale was that some processors (PDP-11? or another PDP-xx) would perform integer arithmetic much faster than memory pointer operation or maybe even did forbid any arithmetic on pointers.

Prepend to file without temp file by manipulating inode?

Prepending to a large file is difficult, since it requires pushing all
other characters forward. However, could it be done by manipulating
the inode as follows?:
Allocate a new block on disk and fill with your prepend data.
Tweak the inode to tell it your new block is now the first
block, and to bump the former first block to the second block
position, former second block to the third position, and so on.
I realize this still requires bumping blocks forward, but it should be
more efficient than having to use a temp file.
I also realize the new first block will be a "short" block (not all the data in the block is part of the file), since your prepend data is unlikely to be exactly the same size as a block.
Or, if inode blocks are simply linked, it would require very little
work to do the above.
NOTE: my last experience directly manipulating disk data was with a
Commodore 1541, so my knowledge may be a bit out of date...

Modern-day operating systems should not allow a user to do that, as inode data structures are specific to the underlying file system.
If your file system/operating system supports it, you could make your file a sparse file by prepending empty data at the beginning, and then writing to the sparse blocks. In theory, this should give you what you want.
YMMV, I'm just throwing around ideas. ;)

This could work! Yes, userland programs should not be mucking around with inodes. Yes, it necessarily depends on whatever scheme used to track blocks by whatever file systems implement this function. None of this is a reason to reject this proposal out of hand.
Here is how it could work.
For the sake of illustration, suppose we have an inode that tracks blocks by an array of direct pointers to data blocks. Further suppose that the inode carries a starting-offset and and ending-offset that apply to the first and last blocks respectively, so you can have less-than-full blocks both at the beginning and end of a file.
Now, suppose you want to prepend data. It would go something like this.
IF (new data will fit into unused space in first data block)
write the new data to the beginning of the first data block
update the starting-offset
return success indication to caller
try to allocate a new data block
IF (block allocation failed)
return failure indication to caller
shift all existing data block pointers down by one
write the ID of the newly-allocated data block into the first slot of the array
write as much data as will fit into the second block (the old first block)
write the rest of data into the newly-allocated data block, shifted to the end
starting-offset := (data block size - length of data in first block)
return success indication to caller

Where is the FILE struct allocated?

In C, when opening a file with
FILE *fin;
fin=fopen("file.bin","rb");
I only have a pointer to a structure of FILE. Where is the actual FILE struct allocated on Windows machine? And does it contain all the necessary information for accessing the file?
My aim is to dump the whole data segment to disk and then to reload the dumped file back to the beginning of the data segment. The code that reloads the dumped file is placed in a separate function. This way, the fin pointer is local and is on the stack, thus is not being overwritten on reload. But the FILE struct itself is not local. I take care not to overwrite the memory region of size sizeof(FILE) that starts at the address fin.
The
fread(DataSegStart,1,szTillFin,fin);
fread(dummy,1,sizeof(FILE),fin);
fread(DataSegAfterFin,1,szFinTillEnd,fin);
operations completes successfully, but I get an assertion failure on
fclose(fin)
Do I overwrite some other necessary file data other than in the FILE struct?

The actual instance of the FILE structure exists within the standard library. Typically the standard library allocates some number of FILE structures, which may or may not be a fixed number of them. When you call fopen(), it returns a pointer to one of those structures.
The data within the FILE structure likely contains pointers to other things such as buffers. You're unlikely to be able to save and restore those structures to disk without some really deep integration with your standard library implementation.
You may be interested in something like CryoPID which does process save and restore at a different level.

It seems like you're trying to do something dangerous, unlikely to work.
fopen allocates a FILE structure and initializes it. fclose releases it. How it allocates it and what it puts in it is implementation dependent. It could contain a pointer to another piece of memory, which is also allocated somewhere (since it's buffered I/O, I guess it does allocate a buffer somewhere).
Writing code that relies on the internals of fopen is dangerous, most likely won't work, and surely won't be stable and portable.

Well, you have a pointer to a FILE object, so technically you know where it is but you should be aware that FILE is deliberately an opaque type. You shouldn't need to know what it contains, you just need to know that you can pass it to functions that know about it to perform certain actions. Additionally, FILE may not be a complete type so sizeof(FILE) might not be correct and, additionally, the object might contain pointers to other structures. Simply avoiding overwriting the FILE object is not likely to be sufficient for you to avoid corrupting the program by writing over most of its memory.

FILE is defined in stdio.h. It contains all the information about the file but, looking at the code you show, I think you don't understand its purpose. It is created and run through the operating system with the C library which fills FILE with information about the file but it is not contained in the file itself.

Minix Internal Fragmentation [duplicate]

I am in the middle of writing some software in C that recursively lists all files in a given directory and now I need to work out the internal fragmentation.
I have spent a long time researching this and have found out that the internal fragmentation on ext2 only occurs in the last block. I know that from an inode number in theory you should be able to get the first and last block addresses but I have no idea how.
I have looked into stat(), fcntl() and all sorts of ways. How do I get the last block address from an inode number?
I have also figured out that once I have the address of the last block that I can test to see how much free space is in that block and this will give me the internal fragmentation.
I know that there is a get_inode and a get_block command but have no idea apart from that!

I don't think you can get at the addresses of disk block via the regular system calls such as stat(). You would probably have to find the raw inode on disk (which means accessing the raw disk, and requires elevated privileges) and processing the data from there.
Classically, you'd find direct blocks, indirect blocks, double-indirect blocks and a triple-indirect block for a file. However, the relevant file system type is about as dead as the dodo is (I don't think I've seen that file system type this millennium), so that's unlikely to be much help now.
There might be a non-standard system call to get at the information, but I doubt it.

Maybe you think too complicated, but roughly the internal fragmentation should be able to calculated if you divide the file size by the block size and take the modulo.
But this is only valid if the file is a "classic one" - with sparse files or files holding much "other information" (such as huge ACLs or extended attributes), there might be a difference. (I don't know where they are stored, but I could imagine that there could be file systems storing them in the last block, effectively (but unnoticedly) reducing the internal fragmentation.)

Prepending Data to a File

There's no way in any operating system I'm aware of for a program to prepend data to a file efficiently.
And yet, this doesn't seem difficult -- the file system can add another extent to a file descriptor if needed.
So the question is, why don't operating systems implement this (rather trivial) operation?

I don't think it's as easy as you suggest. It's true that the file-system could allocate a new block, store the prepended data in it, change the file pointer to point to that block and then chain the rest of the file from that block. Just like adding a node to the front of a linked list, right?
But what happens when (as is probably the case) the prepended data doesn't fill the assigned block. I don't imagine that many filesystems would have a machanism for chaining partial blocks, but even if they do it would result in huge inefficiencies. You'd end up with a file consisting of mostly empty blocks, and you have to have to read and write the entire file to defragment it. Might as well just do the read-and-write operation up front when you're prepending in the first place.

In prepending or appending data to a file, there is always an issue of allocating space. Appending additional space to a file is much easier than prepending because of file descriptors pointing to the beginning of a file's stream. If you want to append to a file, the file descriptor need not be changed, just the size of the file and the allocated memory. If you want to prepend to a file, a new file descriptor must be immediately instantiated, either to write the prepended data to first or to store the location of the data being prepended while the original is updated.
Making a new file descriptor can be tricky, as you must also update any references pointing to it. This is why it is easy for an OS to implement appending data and slightly harder to implement prepending.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight