What is the most efficient way to copy many files programmatically? - c

Once upon a time long ago, we had a bash script that works out a list of files that need to be copied based on some criteria (basically like a filtered version of cp -rf).
This was too slow and was replaced by a C++ program.
What the C++ program does is essentially:
foreach file
read entire file into buffer
write entire file
The program uses Posix calls open(), read() and write() to avoid buffering and other overheads vs iostream and fopen, fread & fwrite.
Is it possible to improve on this?
Notes:
I am assuming these are not sparse files
I am assuming GNU/Linux
I am not assuming a particular filesystem is available
I am not assuming prior knowledge of whether the source and destination are on the same disk.
I am not assuming prior knowledge of the kind of disk, SSD, HDD maybe even NFS or sshfs.
We can assume the source files are on the same disk as each other.
We can assume the destination files will also be on the same disk as each other.
We cannot assume whether the source and destinations are on the same disk or or not.
I think the answer is yes but it is quite nuanced.
Copying speed is of course limited by disk IO not CPU.
But how can we be sure to optimise our use of disk IO?
Maybe the disk has the equivalent of multiple read or write heads available? (perhaps an SSD?)
In which case performing multiple copies in parallel will help.
Can we determine and exploit this somehow?
This is surely well trod territory so rather than re-invent the wheel straight away (though that is always fun) it would be nice to hear what others have tried or would recommend.
Otherwise I will try various things and answer my own question sometime in the distant future.
This is what my evolving answer looks like so far...
If the source and destination are different physical disks then
we can at least read and write at the same time with something like:
writer thread
read from write queue
write file
reader thread
foreach file
read file
queue write on writer thread
If the source and destination are on the same physical disk and we happen to be on a filesystem
with copy on write semantics (like xfs or btrfs) we can potentially avoid actually copying the file at all.
This is apparently called "reflinking".
The cp command supports this using --reflink=auto.
See also:
https://www.reddit.com/r/btrfs/comments/721rxp/eli5_how_does_copyonwrite_and_deduplication_work/
https://unix.stackexchange.com/questions/80351/why-is-cp-reflink-auto-not-the-default-behaviour
From this question
and https://github.com/coreutils/coreutils/blob/master/src/copy.c
it looks as if this is done using an ioctl as in:
ioctl (dest_fd, FICLONE, src_fd);
So a quick win is probably:
try FICLONE on first file.
If it succeeds then:
foreach file
srcFD = open(src);
destFD = open(dest);
ioctl(destFD,FICLONE,srcFD);
else
do it the other way - perhaps in parallel
In terms of low-level system APIs we have:
copy_file_range
ioctl FICLONE
sendfile
I am not clear when to choose one over the other except that copy_file_range is not safe to use with some filesystems notably procfs.
This answer gives some advice and suggests sendfile() is intended for sockets but in fact this is only true for kernels before 2.6.33.
https://www.reddit.com/r/kernel/comments/4b5czd/what_is_the_difference_between_splice_sendfile/
copy_file_range() is useful for copying one file to another (within
the same filesystem) without actually copying anything until either
file is modified (copy-on-write or COW).
splice() only works if one of the file descriptors refer to a pipe. So
you can use for e.g. socket-to-pipe or pipe-to-file without copying
the data into userspace. But you can't do file-to-file copies with it.
sendfile() only works if the source file descriptor refers to
something that can be mmap()ed (i.e. mostly normal files) and before
2.6.33 the destination must be a socket.
There is also a suggestion in a comment that reading multiple files then writing multiple files will result in better performance.
This could use some explanation.
My guess is that it tries to exploit the heuristic that the source files and destination files will be close together on the disk.
I think the parallel reader and writer thread version could perhaps do the same.
The problem with such a design is it cannot exploit any performance gain from the low level system copy APIs.

The general answer is: Measure before trying another strategy.
For HDD this is probably your answer: https://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files

Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.
generate a list of files to copy and store it
copy files in parallel using openMP
#pragma omp parallel for
for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
{
copyFile(*iter);
}
copy each file using copy_file_range()
falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().
Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.

Related

pread/pwrite, buffers and disk cache

If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?
The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.

Is there a better way to manage file pointer in C?

Is it better to use fopen() and fclose() at the beginning and end of every function that use that file, or is it better to pass the file pointer to every of these function ? Or even to set the file pointer as an element of the struct the file is related to.
I have two projects going on and each one use one method (because I thought about passing the file pointer after I began the first one).
When I say better, I mean in term of speed and/or readability. What's best practice ?
Thank you !
It depends. You certainly should document what function is fopen(3)-ing a FILE handle and what function is expecting to fclose(3) it.
You might put the FILE* in a struct but you should have a convention about who and when should the file be read and/or written and closed.
Be aware that opened files are some expansive resources in a process (=your running program). BTW, it is also operating system and file system specific. And FILE handles are buffered, see fflush(3) & setvbuf(3)
On small systems, the maximal number of fopen-ed files handles could be as small as a few dozens. On a current Linux desktop, a process could have a few thousand opened file descriptors (which the internal FILE is keeping, with its buffers). In any case, it is a rather precious and scare resource (on Linux, you might limit it with setrlimit(2))
Be aware that disk IO is very slow w.r.t. CPU.

How do I make files save for concurrent C access?

I have several C-programs, which are accessing (read: fprintf/ write fopen) at the same time different files on the file system. What is the best way to do this concurrent access save? should I write some sort of file locks (and whats the best way to do this?) or are there any better reading methods (preferably in the C99 standard lib, additional dependencies would be a problem)? or should I use something like SQLite?
edit:
I am using Linux as operating system.
edit:
I don't really want to write with different processes in same files, I'm dealing with a legacy monolith code, which saves intermediate steps in files for recycling. I want a way to speed the calculations up by running several calculations at the same time, which have the same intermediate results.
You could use fcntl() with F_SETLK or F_SETLKW:
struct flock lock;
...
fcntl( fd, F_SETLKW, &lock );
See more from man page fcntl(3) or this article.
You can make sure that your files do not get corrupted on concurrent writes from multiple threads/processes by using copy-on-the-write technique:
A writer opens the file it would like to update for reading.
The writer creates a new file with a unique name (mkostemps) and copies the original file into the copy.
The writer modifies the copy.
The writer renames the copy to the original name using rename. This happens atomically, so that users of the file either see the old version of it or the new, but never a partially updated file.
See Things UNIX can do atomically for more details.

mmap( ) vs read( )

I'm writing a bulk ID3 tag editor in C. ID3 tags are usually at the beginning of an mp3 encoded file, although older (version 1) tags are at the end. The app is designed to accept a directory and frame ID list from the command line, then recurse the directory structure updating all the ID3 tags it finds. The user may additionally choose to remove all older (version 1) tags. Another option is to simply display the current tags, without performing an update. The directory might contain 2 files or 2 million. If the user means to update the files, I was planning to load the entire file into memory, perform the updates, then save it (the file may be renamed as well). However, if the user only means to print the current ID3 tags, then loading the entire file seems excessive. After all the file could be 200mb.
I've read through this thread, which was insightful - mmap() vs. reading blocks
So my question is, what the most efficient way to go about this -- read(), mmap() or some combination? Design ideas welcome.
Edit: It's my understanding that mmap essentially delegates loading a file into memory, to the virtual memory subsystem. It seems to me, the VMM would be highly optimized on most systems as it's critical for system performance.
It really depends on what you're trying to do. If all you need to do is hop to a known offset and read out a small tag, read() may be faster (mmap() has to do some rather complex internal accounting). If you are planning on copying out all 200mb of the MP3, however, or scanning it for some tag that may appear at an unknown offset, then mmap() is likely a faster approach.
For example, if you need to shift the entire file down a few hundred bytes in order to insert an ID3 tag, one simple approach would be to expand the file with ftruncate(), mmap the file, then memmove() the contents down a bit. This, however, will destroy the file if your program crashes while it's running. You could also copy the contents of the file into a new file - this is another place where mmap() really shines; you can simply mmap() the old file, then copy all of its data into the new file with a single write().
In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data. However mmap() requires a minimum of two trips into the kernel (three if you clean up the mapping when you're done!) and does some complex internal kernel accounting, and so the fixed overhead can be high.
read() on the other hand involves an extra memory-to-memory copy, and can thus be inefficient for large I/O operations, but is simple, and so the fixed overhead is relatively low. In short, use mmap() for large bulk I/O, and read() or pread() for one-off, small I/Os.
Don't bother with mmap unless your code is CPU bound, specifically due to lots small reads and writes. mmap may sound nice, but it isn't the awesome why isn't everyone using this alternative it looks like.
Given that you're recursing through potentially large directory structures, your bottleneck will be directory IO and concurrency. mmap is not going to help.
Update0
Reading the linked to question finds this answer that supports my experience:
mmap() vs. reading blocks
If you're not normally going to be streaming the file in and then processing it, but rather hopping around (like read the tags at the front and then jump to the end, etc.) then I would use mmap simply because your code will be cleaner and easier to maintain treating the file as a large buffer without having to actually manage the the buffering and paging yourself.
As has been mentioned, if you're processing a lot of data disk I/O is likely going to dominate your processing anyway. mmap may be faster than read, but for reasonable implementations, it's likely not THAT much faster, especially on todays hardware which has continually got faster and faster while disk drives have been stuck at 7200 and 10,000 RPM for years and years.
So, go with mmap and make your code easy and neat.
I don't know if standard POSIX functions reside inside what you are allowed or you will to use for the development but think about these two functions:
int ftruncate(int fildes, off_t length);
int truncate(const char *path, off_t length);
defined in unistd.h, which can be used to truncate a file up to a specified length. In this way you could easily
find where ID3 tags frame begins (don't know if you can compute it easily by just reading the header of the MP3 file but I guess yes)
save the offset
close the file
truncate the file with the provided function
open the file in append binary mode and write new tags
I'm not sure about the performance, you should test this method, but it should load much less things inside ram while providing a senseful way of doing it.

Direct access to hard disk with no FS from C program on Linux

I want to access the whole hard disk directly from a C program. There's no FS on it and never's gonna be one.
I just want to open /dev/sda (for example) and do I/O at the block/sector level of the disk.
I'm planning to write some programs for learning C programming in the Linux environment (I know C language, Python, Perl and Java) but lack confidence with the Linux environment.
For my learning purposes I'm thinking about playing with kyoto-cabinet and saving the value corresponding to the computed hash directly into a "block/sector" of the hard disk, recording the pair: "hash, block/sector reference" into a kyoto-cabinet hash database file.
I don't know if this is feasible using standard C I/O functions or otherwise I'd have to write a "device driver" or something like...
As mentioned elsewhere, under *NIX systems, block devices like /dev/sda can be accessed as plain files. Note that if file system is mounted from the device, opening it as file for writing would fail.
If you want to play with block devices, I would advise to first use the loop device, which presents a plain file as a block device. For example:
dd if=/dev/zero of=./loop_file_10MB bs=1024 count=10K
losetup /dev/loop0 $PWD/loop_file_10MB
After that, /dev/loop0 would behave as if it was a block device, but all information written would be stored in the file.
As device files for drives (e.g. /dev/sda) are block devices, this means you can open, seek and use the file almost like a normal file.
Yes, as others have noted, you can simply open the block device.
However, it's a really good idea to do IO (writes anyway) on block boundaries and whole blocks. You can use something like pread() and pwrite() to do these IO, or mmap some or all of the device.
There are a bunch of ioctls which can be used, see "man sd" for some more info. They don't seem to all be documented in the same place.
In linux/fs.h BLKROSET and a bunch of other ioctls are defined, you have to look around to find out how to use them. You can do useful things like find out how big the device is, and what the block size is.
The source code of the util-linux-ng package is your friend, it contains examples.

Resources