POSIX guaranteeing write to disk - c

As I understand it, if I want to synchronise data to the storage device I can use fsync() to supposedly flush all the OS output caches... but apparently it doesn't guarantee this at all, unlike the documentation tries to deceive you, and the data may not be written to the disk!
This is not very good for many purposes because it can lead to data corruption. How do I use the POSIX libraries (In a portable way if possible) to guarantee that the data has been written (as far as possible) and prevent data corruption?
There is fdatasync() but it is not implemented on OSX, so is there a better and more portable way, or does one have to implement different code on different systems? I'm also not sure if fdatasync() is good enough.
Of-course, in the worst case scenario I could forget about this and use a redundant database library that uses ACID to store the data. I don't want that.
Also I'm interested in how to ensure truncate and rename operations have definitely completed.
Thanks!

You are looking for sync. There is both a program called sync and a system call called sync (man 1 sync and man 2 sync respectively):
#include <unistd.h>
void sync(void);
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
So it will ensure that all pending operations (truncates, renames etc) are in fact written to the disk.
fsync does not claim to flush all output caches, but instead claims to flush all changes to a particular file descriptor to disk. It explicitly does not ensure that the directory entry is updated (in which case a call to fsync on a filedescriptor for the directory is needed).
fsyncdata is even more useless as it will not flush file metadata and instead will just ensure that the data in the file is flushed.
It is a good idea to trust the manpages. I won't say there are not mistakes, but they tend to be extremely accurate.

Related

Is write/fwrite guaranteed to be sequential?

Is the data written via write (or fwrite) guaranteed to be persisted to the disk in a sequence manner? In particular in relation to fault tolerance. If the system should fail during the write, will it behave as though first bytes were written first and writing stopped mid-stream (as opposed to random blocks written).
Also, are sequential calls to write/fwrite guaranteed to be sequential? According to POSIX I find only that a call to read is guaranteed to consider a previous write.
I'm asking as I'm creating a fault tolerant data store that persists to disks. My logical order of writing is such that faults won't ruin the data, but if the logical order isn't being obeyed I have a problem.
Note: I'm not asking if persistence is guaranteed. Only that if my calls to write do eventually persist they obey the order in which I actually write.
The POSIX docs for write() state that "If the O_DSYNC bit has been set, write I/O operations on the file descriptor shall complete as defined by synchronized I/O data integrity completion". Presumably, if the O_DSYNC bit isn't set, then the synchronization of I/O data integrity completion is unspecified. POSIX also says that "This volume of POSIX.1-2008 is also silent about any effects of application-level caching (such as that done by stdio)", so I think there is no guarantee for fwrite().
I am not an expert, but I might know enough to point you in the right direction:
The most disastrous case is if you lose power, so that is the only one worth considering.
Start with a file with X bytes of meaningful content, and a header that indicates it.
Write Y bytes of meaningful content somewhere that won't invalidate X.
Call fsync (slow!).
Update the header (probably has to be less than your disk's block size).
I don't know if changing the length of a file is safe. I don't know how much depends on the filesystem mount mode, except that any "safe" mode is probably completely unusable for systems need to have even a slight level of performance.
Keep in mind that on some systems the fsync call lies and just returns without doing anything safely. You can tell because it returns quickly. For this reason, you need to make pretty large transactions (i.e. much larger than application-level transactions).
Keep in mind that the kind of people who solve this problem in the real world get paid high 6 figures at least. The best answer for the rest of us is either "just send the data to postgres and let it deal with it." or "accept that we might have to lose data and revert to an hourly backup."
No, in general as far as POSIX and reality are concerned, filesystems do not provide such guarantee. Order of persistence (in which disk makes them permanent on the platter) is not dictated by order in which syscalls were made, or position within the file, or order of sectors on disk. Filesystems retain the data to be written in memory for several seconds, hoarding as much as possible, and later send them to disk in batches, in whatever order they seem fit. And regardless of how kernel sends it to disk, the disk itself can reorder writes on its own due to NCQ.
Filesystems have means of ensuring some ordering for safety. Barriers were used in the past, and explicit flushes and FUA requests nowadays. There is a good article on LWN about that. But those are used by filesystems, not applications.
I strongly suggest reading the article about Application Level Consistency. Not sure how relevant to you but it shows many behaviors that developers wrongly assumed in the past.
Answer from o11c is a good way to do it.
Yes, so long as we are not talking about adding in the complexity of multithreading. It will be in the same order on disk, for what makes it to disk. It buffers to memory and dumps that memory to disk when it fills up, or you close the file.

How to make sure data integrity after sync/fsync/syncfs to portable device

Based on sync manual page, there is no guarantee the disc will flush its cache after calling sync:
"According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) "
And, in fsync manual, there is no mention about this.
Is there ways to makes sure all writes to disc especially portable device (USB) has been finished after calling sync? I have encountered cases that data and metadata information has not fully written to disc after calling sync/fsync.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
For IXish systems:
Unmount the USB-device's partitions using the umount command or the umount() system call.
Doing
blockdev --flushbufs
might flush the device's buffer, but does not keep anybody from accessing the devices again and refill buffers.
Also there is this kernel interface in the /proc file system:
/proc/sys/vm/drop_caches
which can be used to flush different buffers:
Verbatim from https://www.kernel.org/doc/Documentation/sysctl/vm.txt
[...]
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
[...]
At least in principle, this is a Linux bug. The specification for sync functions is that the data is fully written to permanent storage; leaving it in a hardware cache is not conforming.
I'm not sure what the correct workaround is, but you can probably strace the hwparm utility running with the -F option (I think that's the right one) to see what it's doing (or read the source, but strace is a lot easier).

any reason to call fsync before a call to fstat

I have a piece of legacy code that issues a call to fsync before a call to fstat to determine the filesize of the target file. (specifically the code is only accessing st_size out of the stat struct.)
Having looked at the docs, I don't believe this is a necessary call, but I wanted the expert's opinions.
On a correctly implemented filesystem, issuing a call to fsync or fdatasync should not affect the result of any subsequent stat/fstat/lstat call. Its only effect should be that any unflushed writes and, in the case of fsync, any modified metadata are committed to permanent storage. stat and its variants will work just fine with cached writes, regardless of whether the actual data has made to permanent storage or not.
That said, whether fstat is needed in the piece of code that you are studying is a matter of semantics and depends on how the result of fstat is used. For example:
If it is used due to the misconception that fsync needs to be called to be able to get current metadata with stat, then you can probably remove it.
If it is used to e.g. write some sort of checkpointing data then it is not exactly irrelevant, although the call order might need to be reversed - for a growing file the checkpointing data would need to indicate portions of the file that have certainly made it to permanent storage, so it would make sense to call fstat, then call fsync* and then write the checkpoint information.
If it is used as some sort of UI progress monitor for an I/O bound operation, it may make sense to display the amount of data that has been actually committed to disk. In that case, though, the precision of the monitor is non-critical, so the call order may not matter that much.
So, how is the result of fstat used in your case?
Disclaimer: there may be filesystem implementations out there, e.g. networked/distributed ones where calling fsync may update the local client metadata cache for a file. In that case that fsync call may indeed improve the reliability of the code. If that is the case, however, then you probably have worse problems than just a little performance issue...

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

fsync vs write system call

I would like to ask a fundamental question about when is it useful to use a system call like fsync. I am beginner and i was always under the impression that write is enough to write to a file, and samples that use write actually write to the file at the end.
So what is the purpose of a system call like fsync?
Just to provide some background i am using Berkeley DB library version 5.1.19 and there is a lot of talk around the cost of fsync() vs just writing. That is the reason i am wondering.
Think of it as a layer of buffering.
If you're familiar with the standard C calls like fopen and fprintf, you should already be aware of buffering happening within the C runtime library itself.
The way to flush those buffers is with fflush which ensures that the information is handed from the C runtime library to the OS (or surrounding environment).
However, just because the OS has it, doesn't mean it's on the disk. It could get buffered within the OS as well.
That's what fsync takes care of, ensuring that the stuff in the OS buffers is written physically to the disk.
You may typically see this sort of operation in logging libraries:
fprintf (myFileHandle, "something\n"); // output it
fflush (myFileHandle); // flush to OS
fsync (fileno (myFileHandle)); // flush to disk
fileno is a function which gives you the underlying int file descriptor for a given FILE* file handle, and fsync on the descriptor does the final level of flushing.
Now that is a relatively expensive operation since the disk write is usually considerably slower than in-memory transfers.
As well as logging libraries, one other use case may be useful for this behaviour. Let me see if I can remember what it was. Yes, that's it. Databases! Just like Berzerkely DB. Where you want to ensure the data is on the disk, a rather useful feature for meeting ACID requirements :-)

Resources