Read from disk after write instead of cache - c

Here's the task I'm trying to perform on a linux host, with a C program:
Write random data to the disk, call fysnc() to flush data to disk, then read back what was written from the disk to ensure the disk controller wrote the data correctly. The problem I am running into is that reads appear to be answered by server-side caching rather than from the device itself. Here's what I've already tried:
1. O_DIRECT (a gigantic pain in the butt, abandoned)
2. posix_fadvise(fd,0,0,POSIX_FADV_DONTNEED)
3. posix_fadvise(fd,0,0,POSIX_FADV_NOREUSE)
4. O_SYNC
5. O_ASYNC
In every case, iostat shows 0 rrqm/s and thousands of write requests. I could be a woefully uninformed linux user, but it is my belief that if no IOs are shown in rrqm/s then reads are being answered by the OS cache instead of the device itself.
"Why not use iozone or iometer, or any of the billions of other tools that already stress disks?" Well, to be honest, if HP-UX's HAZARD worked on anything except HP-UX, I would, but nothing else comes close to what hazard can do, so I'm making my own.

You need to do the equivalent of the following shell commands:
sync # Instruct all data to get flushed to disk
echo 3 > /proc/sys/vm/drop_caches # Instruct VM system to clear caches
and then try reading the file again.
One way to do it from C would be something approximating:
sync();
int fd = open("/proc/sys/vm/drop_caches", O_WRONLY|O_TRUNC)
write(fd, "3\n");
close(fd);

You should not go thru the file system to test a disk. You should read and write the raw partitions (e.g. /dev/sdc5)
On most current Linux systems and hardware, disks have a SMART interface. You should use it, see smartmontools and study its source code. (I guess that there are some ioctl(2) related to that.)

Related

What is the most efficient way to copy many files programmatically?

Once upon a time long ago, we had a bash script that works out a list of files that need to be copied based on some criteria (basically like a filtered version of cp -rf).
This was too slow and was replaced by a C++ program.
What the C++ program does is essentially:
foreach file
read entire file into buffer
write entire file
The program uses Posix calls open(), read() and write() to avoid buffering and other overheads vs iostream and fopen, fread & fwrite.
Is it possible to improve on this?
Notes:
I am assuming these are not sparse files
I am assuming GNU/Linux
I am not assuming a particular filesystem is available
I am not assuming prior knowledge of whether the source and destination are on the same disk.
I am not assuming prior knowledge of the kind of disk, SSD, HDD maybe even NFS or sshfs.
We can assume the source files are on the same disk as each other.
We can assume the destination files will also be on the same disk as each other.
We cannot assume whether the source and destinations are on the same disk or or not.
I think the answer is yes but it is quite nuanced.
Copying speed is of course limited by disk IO not CPU.
But how can we be sure to optimise our use of disk IO?
Maybe the disk has the equivalent of multiple read or write heads available? (perhaps an SSD?)
In which case performing multiple copies in parallel will help.
Can we determine and exploit this somehow?
This is surely well trod territory so rather than re-invent the wheel straight away (though that is always fun) it would be nice to hear what others have tried or would recommend.
Otherwise I will try various things and answer my own question sometime in the distant future.
This is what my evolving answer looks like so far...
If the source and destination are different physical disks then
we can at least read and write at the same time with something like:
writer thread
read from write queue
write file
reader thread
foreach file
read file
queue write on writer thread
If the source and destination are on the same physical disk and we happen to be on a filesystem
with copy on write semantics (like xfs or btrfs) we can potentially avoid actually copying the file at all.
This is apparently called "reflinking".
The cp command supports this using --reflink=auto.
See also:
https://www.reddit.com/r/btrfs/comments/721rxp/eli5_how_does_copyonwrite_and_deduplication_work/
https://unix.stackexchange.com/questions/80351/why-is-cp-reflink-auto-not-the-default-behaviour
From this question
and https://github.com/coreutils/coreutils/blob/master/src/copy.c
it looks as if this is done using an ioctl as in:
ioctl (dest_fd, FICLONE, src_fd);
So a quick win is probably:
try FICLONE on first file.
If it succeeds then:
foreach file
srcFD = open(src);
destFD = open(dest);
ioctl(destFD,FICLONE,srcFD);
else
do it the other way - perhaps in parallel
In terms of low-level system APIs we have:
copy_file_range
ioctl FICLONE
sendfile
I am not clear when to choose one over the other except that copy_file_range is not safe to use with some filesystems notably procfs.
This answer gives some advice and suggests sendfile() is intended for sockets but in fact this is only true for kernels before 2.6.33.
https://www.reddit.com/r/kernel/comments/4b5czd/what_is_the_difference_between_splice_sendfile/
copy_file_range() is useful for copying one file to another (within
the same filesystem) without actually copying anything until either
file is modified (copy-on-write or COW).
splice() only works if one of the file descriptors refer to a pipe. So
you can use for e.g. socket-to-pipe or pipe-to-file without copying
the data into userspace. But you can't do file-to-file copies with it.
sendfile() only works if the source file descriptor refers to
something that can be mmap()ed (i.e. mostly normal files) and before
2.6.33 the destination must be a socket.
There is also a suggestion in a comment that reading multiple files then writing multiple files will result in better performance.
This could use some explanation.
My guess is that it tries to exploit the heuristic that the source files and destination files will be close together on the disk.
I think the parallel reader and writer thread version could perhaps do the same.
The problem with such a design is it cannot exploit any performance gain from the low level system copy APIs.
The general answer is: Measure before trying another strategy.
For HDD this is probably your answer: https://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files
Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.
generate a list of files to copy and store it
copy files in parallel using openMP
#pragma omp parallel for
for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
{
copyFile(*iter);
}
copy each file using copy_file_range()
falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().
Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.

AIO in C on Unix - aio_fsync usage

I can't understand what this function aio_fsync does. I've read man pages and even googled but can't find an understandable definition. Can you explain it in a simple way, preferably with an example?
aio_fsync is just the asynchronous version of fsync; when either have completed, all data is written back to the physical drive media.
Note 1: aio_fsync() simply starts the request; the fsync()-like operation is not finished until the request is completed, similar to the other aio_* calls.
Note 2: only the aio_* operations already queued when aio_fsync() is called are included.
As you comment mentioned, if you don't use fsync or aio_fsync, the data will still appear in the file after your program ends. However, if the machine was abruptly powered off, it would very likely not be there.
This is because when you write to a file, the OS actually writes to the Page Cache which is a copy of disk sectors kept in RAM, not the to the disk itself. Of course, even before it is written back to the disk, you can still see the data in RAM. When you call fsync() or aio_fsync() it will insure that writes(), aio_writes(), etc. to all parts of that file are written back to the physical disk, not just RAM.
If you never call fsync(), etc. the OS will eventually write the data back to the drive whenever it has spare time to do it. Or an orderly OS shutdown should do it as well.
I would say you should usually not worry about manually calling these unless you need to insure that your data, say a log record, is flushed to the physical disk and needs to be more likely to survive an abrupt system crash. Clearly database engines would be doing this for transactions and journals.
However, there are other reasons the data may not survive this and it is very complex to insure absolute consistency in the face of failures. So if your application does not absolutely need it then it is perfectly reasonable to let the OS manage this for you. For example, if the output .o of the compiler ended up incomplete/corrupt because you power-cycled the machine in the middle of a compile or shortly after, it would not surprise anyone - you would just restart the build operation.

How to make sure data integrity after sync/fsync/syncfs to portable device

Based on sync manual page, there is no guarantee the disc will flush its cache after calling sync:
"According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) "
And, in fsync manual, there is no mention about this.
Is there ways to makes sure all writes to disc especially portable device (USB) has been finished after calling sync? I have encountered cases that data and metadata information has not fully written to disc after calling sync/fsync.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
For IXish systems:
Unmount the USB-device's partitions using the umount command or the umount() system call.
Doing
blockdev --flushbufs
might flush the device's buffer, but does not keep anybody from accessing the devices again and refill buffers.
Also there is this kernel interface in the /proc file system:
/proc/sys/vm/drop_caches
which can be used to flush different buffers:
Verbatim from https://www.kernel.org/doc/Documentation/sysctl/vm.txt
[...]
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
[...]
At least in principle, this is a Linux bug. The specification for sync functions is that the data is fully written to permanent storage; leaving it in a hardware cache is not conforming.
I'm not sure what the correct workaround is, but you can probably strace the hwparm utility running with the -F option (I think that's the right one) to see what it's doing (or read the source, but strace is a lot easier).

What posix_fadvise() args for sequential file write?

I am working on an application which does sequentially write a large file (and does not read at all), and I would like to use posix_fadvise() to optimize the filesystem behavior.
The function description in the manpage suggests that the most appropriate strategy would be POSIX_FADV_SEQUENTIAL. However, the Linux implementation description doubts that:
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely.
As I'm only writing data (overwriting files possibly too), I don't expect any readahead. Should I then stick with my POSIX_FADV_SEQUENTIAL or rather use POSIX_FADV_RANDOM to disable it?
How about other options, such as POSIX_FADV_NOREUSE? Or maybe do not use posix_fadvise() for writing at all?
Most of the posix_fadvise() flags (eg POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM) are hints about readahead rather than writing.
There's some advice from Linus here and here about getting good sequential write performance. The idea is to break the file into large-ish (8MB) windows, then loop around doing:
Write out window N with write();
Request asynchronous write-out of window N with sync_file_range(..., SYNC_FILE_RANGE_WRITE)
Wait for the write-out of window N-1 to complete with sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER)
Drop window N-1 from the pagecache with posix_fadvise(..., POSIX_FADV_DONTNEED)
This way you never have more than two windows worth of data in the page cache, but you still get the kernel writing out part of the pagecache to disk while you fill the next part.
It all depends on the temporal locality of your data. If your application won't need the data soon after it was written, then you can go with POSIX_FADV_NOREUSE to avoid writing to the buffer cache (in a similar way as the O_DIRECT flag from open()).
As far as writes go I think that you can just rely on the OSes disk IO scheduler to do the right thing.
You should keep in mind that while posix_fadvise is there specifically to give the kernel hints about future file usage patterns the kernel also has other data to help it out.
If you don't open the file for reading then it would only need to read blocks in when they were partially written. If you were to truncate the file to 0 then it doesn't even have to do that (you said that you were overwriting).

Reading a sector on the boot disk

This is a continuation of my question about reading the superblock.
Let's say I want to target the HFS+ file system in Mac OS X. How could I read sector 2 of the boot disk? As far as I know Unix only provides system calls to read from files, which are never stored at that location.
Does this require either 1) the program to run kernel mode, or 2) the program to be written in Assembly? I would prefer to avoid either of these restrictions, particularly the latter.
I've done this myself on the Mac, see my disk editor tool: http://apps.tempel.org/iBored
You'd open the drive using the /dev/diskN or /dev/rdiskN (N is a disk index number starting from 0). Then you can use lseek (make sure to use the 64 bit range version!) and read/write calls on the opened file.
Also, use the shell command "ls /dev/disk*" to see which drives exist currently. And note that the drives also exist with a "sM" extension where M is the partition number. That way, could can also read partitions directly.
Or, you could just use the shell tool "xxd" or "dd" to read data and then use their output. Might be easier.
You'll not be able to read your root disk and other internal disks unless you run as root, though. You may be able to access other drives as long as they were mounted by the user, or have their permissions disabled. But you may also need to unmount the drive's volumes first. Look for the unmount command in the shell command "diskutil".
Hope this helps.
Update 2017: On OS X 10.11 and later SIP may also prevent you from directly accessing the disk sectors.
In Linux, you can read from the special device file /dev/sda, assuming the hard drive you want to read is the first one. You need to be root to read this file. To read sector 2, you just seek to offset 2*SECTOR_SIZE and read in SECTOR_SIZE bytes.
I don't know if this device file is available on OS X. Check for interestingly named files under /dev such as /dev/sda or /dev/hda.
I was also going to suggest hitting the /dev/ device file for the volume, but you might want to contact Amit Singh who has written an hfsdebug utility and has probably done just what you want to do.
How does this work in terms of permissions? Wouldn't reading from /dev/... be insecure since if you read far enough you would be able to read files for which you do not have read access?

Resources