The Linux programmer's manual manpage fallocate(2) states:
If the FALLOC_FL_UNSHARE flag is specified in mode, shared file data extents will be made private to the file to guarantee that a subsequent write will not fail due to lack of space. Typically, this will be done by performing a copy-on-write operation on all shared data in the file. This flag may not be supported by all filesystems.
That's cool, but… How do I create shared file data extents in the first place?
Shared data extents are created when the underlying filesystem supports reflinks (example: XFS and BTRFS) and you perform a cp with the --reflink flag or use the ioctl_ficlonerange(2) syscall.
Looking at the kernel code, I see FALLOC_FL_UNSHARE_RANGE being handled only in case of XFS, so maybe this flag to fallocate works only on XFS as of now.
Related
Once upon a time long ago, we had a bash script that works out a list of files that need to be copied based on some criteria (basically like a filtered version of cp -rf).
This was too slow and was replaced by a C++ program.
What the C++ program does is essentially:
foreach file
read entire file into buffer
write entire file
The program uses Posix calls open(), read() and write() to avoid buffering and other overheads vs iostream and fopen, fread & fwrite.
Is it possible to improve on this?
Notes:
I am assuming these are not sparse files
I am assuming GNU/Linux
I am not assuming a particular filesystem is available
I am not assuming prior knowledge of whether the source and destination are on the same disk.
I am not assuming prior knowledge of the kind of disk, SSD, HDD maybe even NFS or sshfs.
We can assume the source files are on the same disk as each other.
We can assume the destination files will also be on the same disk as each other.
We cannot assume whether the source and destinations are on the same disk or or not.
I think the answer is yes but it is quite nuanced.
Copying speed is of course limited by disk IO not CPU.
But how can we be sure to optimise our use of disk IO?
Maybe the disk has the equivalent of multiple read or write heads available? (perhaps an SSD?)
In which case performing multiple copies in parallel will help.
Can we determine and exploit this somehow?
This is surely well trod territory so rather than re-invent the wheel straight away (though that is always fun) it would be nice to hear what others have tried or would recommend.
Otherwise I will try various things and answer my own question sometime in the distant future.
This is what my evolving answer looks like so far...
If the source and destination are different physical disks then
we can at least read and write at the same time with something like:
writer thread
read from write queue
write file
reader thread
foreach file
read file
queue write on writer thread
If the source and destination are on the same physical disk and we happen to be on a filesystem
with copy on write semantics (like xfs or btrfs) we can potentially avoid actually copying the file at all.
This is apparently called "reflinking".
The cp command supports this using --reflink=auto.
See also:
https://www.reddit.com/r/btrfs/comments/721rxp/eli5_how_does_copyonwrite_and_deduplication_work/
https://unix.stackexchange.com/questions/80351/why-is-cp-reflink-auto-not-the-default-behaviour
From this question
and https://github.com/coreutils/coreutils/blob/master/src/copy.c
it looks as if this is done using an ioctl as in:
ioctl (dest_fd, FICLONE, src_fd);
So a quick win is probably:
try FICLONE on first file.
If it succeeds then:
foreach file
srcFD = open(src);
destFD = open(dest);
ioctl(destFD,FICLONE,srcFD);
else
do it the other way - perhaps in parallel
In terms of low-level system APIs we have:
copy_file_range
ioctl FICLONE
sendfile
I am not clear when to choose one over the other except that copy_file_range is not safe to use with some filesystems notably procfs.
This answer gives some advice and suggests sendfile() is intended for sockets but in fact this is only true for kernels before 2.6.33.
https://www.reddit.com/r/kernel/comments/4b5czd/what_is_the_difference_between_splice_sendfile/
copy_file_range() is useful for copying one file to another (within
the same filesystem) without actually copying anything until either
file is modified (copy-on-write or COW).
splice() only works if one of the file descriptors refer to a pipe. So
you can use for e.g. socket-to-pipe or pipe-to-file without copying
the data into userspace. But you can't do file-to-file copies with it.
sendfile() only works if the source file descriptor refers to
something that can be mmap()ed (i.e. mostly normal files) and before
2.6.33 the destination must be a socket.
There is also a suggestion in a comment that reading multiple files then writing multiple files will result in better performance.
This could use some explanation.
My guess is that it tries to exploit the heuristic that the source files and destination files will be close together on the disk.
I think the parallel reader and writer thread version could perhaps do the same.
The problem with such a design is it cannot exploit any performance gain from the low level system copy APIs.
The general answer is: Measure before trying another strategy.
For HDD this is probably your answer: https://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files
Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.
generate a list of files to copy and store it
copy files in parallel using openMP
#pragma omp parallel for
for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
{
copyFile(*iter);
}
copy each file using copy_file_range()
falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().
Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.
I need to know the creation datetime of the filesystem on a disk (in a Linux machine) with C. I would like to avoid using shell commands, such as
tune2fs -l /dev/sdb2 | grep 'Filesystem created:'
and make a parser.
Thanks
From a program coded in C (or in any language capable of calling C routines) you would use the stat(2) system call (or, with recent kernels and some file systems, the statx(2) one) to query the creation time of a given file (or directory). Of course, commands like ls(1) or stat(1) are using internally that stat(2) system calll.
There is no standard, and file system neutral, way to get the creation time of a given file system. That information is not always kept. I guess that FAT filesystems, or distributed file systems such as NFS, don't keep that.
You might use stat(2) on the mount point of that file system.
The statfs(2) system call retrieves some filesystem information, but does not give any time stamps.
For ext4 file systems, see ext4(5) and use proc(5). You might parse /proc/mounts and some /proc/fs/ext4/*/ directory. The pseudofiles in /proc/ can be parsed quickly and usually won't involve physical disk IO.
You could also work at the ext2/3/4 disk partition level, on an unmounted file ext[234] system, with a library like (or programs from) e2fsprogs. You should not access (even just read) a disk partition containing some file system if that file system is mounted.
(your question should give some motivation and context)
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead. Typically, a call to fopen() returns a file descriptor which is passed to mmap() to create the file's memory map. shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc). We do specify a string as a parameter to shm_open but unlike fopen(), it is not a name of a real file on the visible filesystem (mounted HDD, Flash drives, SSD ... etc). The same string name can be used by totally unrelated processes to map the same region into their address spaces.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ? Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region (Well, i think it has to be some kind of files since it returns a file descriptor) ? Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
People say shm_open is faster then fopen because no disk operations are involved so the theory i suggest is that the kernel uses an invisible RAM-based filesystem to implement shared memory with shm_open !
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces.
This is generally false, at least on a desktop or laptop running a recent Linux distribution, with some reasonable amount of RAM (e.g. 8Gbytes at least).
So, the disk is not relevant. You could use shm_open without any swap. See shm_overview(7), and notice that /dev/shm/ is generally a tmpfs mounted file system so don't use any disk. See tmpfs(5). And tmpfs don't use the disk (unless you reach thrashing conditions, which is unlikely) since it works in virtual memory.
the filesystem is involved to write changes on the disk which is a great overhead.
This is usually wrong. On most systems, recently written files are in the page cache, which does not reach the disk quickly (BTW, that is why the shutdown procedure needs to call sync(2) which is rarely used otherwise...).
BTW, on most desktops and laptops, it is easy to observe. The hard disk has some LED, and you won't see it blinking when using shm_open and related calls. BTW, you could also use proc(5) (notably /proc/diskstats etc....) to query the kernel about its disk activity.
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead.
That seems rather presumptuous, and not entirely correct. Substantially all machines that implement shared memory regions (in the IPC sense) have virtual memory units by which they support the feature. There may or may not be any persistent storage backing any particular shared memory segment, or any part of it. Only the part, if any, that is paged out needs to be backed by such storage.
shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc).
That shm_open() has an interface modeled on that of open(), and that it returns a file descriptor that can meaningfully be used with certain general-purpose I/O function, do not imply that shm_open() "works in the same way" in any broader sense. Pretty much all system resources are represented to processes as files. This affords a simpler overall system interface, but it does not imply any commonality of the underlying resources beyond the fact that they can be manipulated via the same functions -- to the extent that indeed they can be.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ?
The parameter is a string identifying the shared memory segment. You already knew that, but you seem to think there's more to it than that. There isn't, at least not at the level (POSIX) at which the shm_open interface is specified. The identifier is meaningful primarily to the kernel. Different implementations handle the details differently.
Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region
Could be, but probably isn't. Any filesystem interface provided for it is likely (but not certain) to be a virtual filesystem, not actual, accessible files on disk. Persistent storage, if used, is likely to be provided out of the system's swap space.
(Well, i think it has to be some kind of files since it returns a file descriptor) ?
Such a conclusion is unwarranted. Sockets and pipes are represented via file descriptors, too, but they don't have corresponding accessible files.
Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
That's probably a better conception, though again, there might not be any persistent storage at all. To the extent that there is any, however, it is likely to be part of the system's swap space, which is not all that mysterious.
How to get most recently accessed file in Linux?
I used stat() call checking for st_atime, but it is not updating if i open and read the file.
You can check if your filesystem is mounted with the noatime or relatime option:
greek0#orest:/home/greek0$ cat /proc/mounts
/dev/md0 / ext3 rw,noatime,errors=remount-ro,data=ordered 0 0
...
These mount options are often used because they increase filesystem performance. Without them, every single read of a file turns into a write to the disk (for updating the atime).
In general, you can't rely on atime to have any useful meaning on most computers.
If it's Ok to only detect accesses to files that happen while your program is running, you can look into inotify. It provides a method to be notified of currently ongoing filesystem accesses.
If that doesn't satisfy your requirements, I'm afraid you're out of luck.
Posix compliance is a standard that is been followed by many a companies.
I have few question around this area,
1. does all the file systems need to be posix compliant?
2. are applications also required to be posix compliant?
3. are there any non posix filesystems?
In the area of "requires POSIX filesystem semantics" what is typically meant is:
allows hierarchical file names and resolution (., .., ...)
supports at least close-to-open semantics
umask/unix permissions, 3 filetimes
8bit byte support
supports atomic renames on same filesystem
fsync()/dirfsync() durability gurantee/limitation
supports multi-user protection (resizing file returns 0 bytes not previous content)
rename and delete open files (Windows does not do that)
file names supporting all bytes beside '/' and \0
Sometimes it also means symlink/hardlink support as well as file names and 32bit file pointers (minimum). In some cases it is also used to refer specific API features like fcntl() locking, mmap() or truncate() or AIO.
When I think about POSIX compliance for distributed file systems, I use the general standard that a distributed file system is POSIX compliant if multiple processes running on different nodes see the same behavior as if they were running on the same node using a local file system. This basically has two implications:
If the system has multiple buffer-caches, it needs to ensure cache consistency.
Various mechanisms to do so include locks and leases. An example of incorrect behavior in this case would be a writer who writes successfully on one node but then a reader on a different node receives old data.
Note however that if the writer/reader are independently racing one another that there is no correct defined behavior because they do not know which operation will occur first. But if they are coordinating with each other via some mechanism like messaging, then it would be incorrect if the writer completes (especially if it issues a sync call), sends a message to the reader which is successfully received by the reader, and then the reader reads and gets stale data.
If data is striped across multiple data servers, reads and writes that span multiple stripes must be atomic.
For example, when a reader reads across stripes at the same time as a writer writes across those same stripes, then the reader should either receive all stripes as they were before the write or all stripes as they were after the write. Incorrect behavior would be for the reader to receive some old and some new.
Contrary to the above, this behavior must work correctly even when the writer/reader are racing.
Although my examples were reads/writes to a single file, correct behavior also includes write/writes to a single file as well as read/writes and write/writes to the hierarchical namespace via calls such as stat/readdir/mkdir/unlink/etc.
Answering your questions in a very objective way:
1. does all the file systems need to be posix compliant?
Actually not. In fact POSIX defines some standards for operational systems in general. Good to have, but no really required.
2. are applications also required to be posix compliant?
No.
3. are there any non posix filesystems?
HDFS (hadoop file system)