fsync vs write system call - filesystems

I would like to ask a fundamental question about when is it useful to use a system call like fsync. I am beginner and i was always under the impression that write is enough to write to a file, and samples that use write actually write to the file at the end.
So what is the purpose of a system call like fsync?
Just to provide some background i am using Berkeley DB library version 5.1.19 and there is a lot of talk around the cost of fsync() vs just writing. That is the reason i am wondering.

Think of it as a layer of buffering.
If you're familiar with the standard C calls like fopen and fprintf, you should already be aware of buffering happening within the C runtime library itself.
The way to flush those buffers is with fflush which ensures that the information is handed from the C runtime library to the OS (or surrounding environment).
However, just because the OS has it, doesn't mean it's on the disk. It could get buffered within the OS as well.
That's what fsync takes care of, ensuring that the stuff in the OS buffers is written physically to the disk.
You may typically see this sort of operation in logging libraries:
fprintf (myFileHandle, "something\n"); // output it
fflush (myFileHandle); // flush to OS
fsync (fileno (myFileHandle)); // flush to disk
fileno is a function which gives you the underlying int file descriptor for a given FILE* file handle, and fsync on the descriptor does the final level of flushing.
Now that is a relatively expensive operation since the disk write is usually considerably slower than in-memory transfers.
As well as logging libraries, one other use case may be useful for this behaviour. Let me see if I can remember what it was. Yes, that's it. Databases! Just like Berzerkely DB. Where you want to ensure the data is on the disk, a rather useful feature for meeting ACID requirements :-)

Related

Should fsync be used after each fclose?

On a Linux (Ubuntu Platform) device I use a file to save mission critical data.
From time to time (once in about 10,000 cases), the file gets corrupted for unspecified reasons.
In particular, the file is truncated (instead of some kbyte it has only about 100 bytes).
Now, in the sequence of the software
the file is opened,
modified and
closed.
Immediately after that, the file might be opened again (4), and something else is being done.
Up to now I didn't notice, that fflush (which is called upon fclose) doesn't write to the file system, but only to an intermediate buffer. Could it be, that the time between 3) and 4) is too short and the change from 2) is not yet written to disc, so when I reopen with 4) I get a truncated file which, when it is closed again leads to permanent loss of those data?
Should I use fsync() in that case after each file write?
What do I have to consider for power outages? It is not unlikely that the data corruption is related to power down.
fwrite is writing to an internal buffer first, then sometimes (at fflush or fclose or when the buffer is full) calling the OS function write.
The OS is also doing some buffering and writes to the device might get delayed.
fsync is assuring that the OS is writing its buffers to the device.
In your case where you open-write-close you don't need to fsync. The OS knows which parts of the file are not yet written to the device. So if a second process wants to read the file the OS knows that it has the file content in memory and will not read the file content from the device.
Of course when thinking about power outage it might (depending on the circumstances) be a good idea to fsync to be sure that the file content is written to the device (which as Andrew points out, does not necessarily mean that the content is written to disc, because the device itself might do buffering).
Up to now I didn't notice, that fflush (which is called upon fclose) doesn't write to the file system, but only in an intermediate buffer. Could it be, that the time between 3) and 4) is too short and the change from 2) is not yet written to disc, so when I reopen with 4) I get a truncated file which, when it is closed again leads to permanent loss of those data?
No. A system that behaved that way would be unusable.
Should I use fsync() in that case after each file write?
No, that will just slow things down.
What do I have to consider for power outtages? It is not unlikeley, that the data corruption is related to power down.
Use a filesystem that's resistant to such corruption. Possibly even consider using a safer modification algorithm such as writing out a new version of the file with a different name, syncing, and then renaming it on top of the existing file.
If what you're doing is something like this:
FILE *f = fopen("filename", "w");
while(...) {
fwrite(data, n, m, f);
}
fclose(f);
Then what can happen is that another process can open the file while it's being written (between the open and write system calls that the C library runs behind the scenes, or between separate write calls). Then they would see only a partially written file.
The workaround to that is to write the file with another name, and rename() it over the actual filename. The downside is that you need double the amount of space.
If you are sure the opening of the file happens only after the write, then that cannot happen. But then there has to be some syncronization between the writer and reader so that the latter does not start reading too early.
fsync() tells the system to write the changes to the actual storage, which is a bit of an oddball within the other POSIX system calls, since I think nothing is specified of a system if it crashes, and that's the only situation where it matters if some data is stored on the actual storage, and not in some cache. Even with fsync() it's still possible for the storage hardware to cache the data, or for an unrelated corruption to trash the file system when the system crashes.
If you're happy to let the OS do its job, and don't need to think about crashes, you can ignore fsync() completely and just let the data be written when the OS sees fit. If you do care about crashes, you have to look more closely into what guarantees the filesystem makes (or doesn't). E.g. at least at some point, the ext* developers pretty much demanded applications to do an fsync() on the containing directory, too.

Understanding low level file routines

I am going through Mark Burgess's "The GNU C Programming Tutorial". I have come across the following information:
Even though low-level fle routines do not use buffering, and once you call write, your data can be read from the file immediately, it may take up to a minute before your data is physically written to disk. (Page:142)
Firstly, is "it may take up to a minute(some time) before your data is written to disk" true?
Secondly, when low level file routines are not using buffering why will the delay take place?
There are two places where I/O buffering can occur (at least — it could be more than just two).
One is in the application; the standard I/O functions using FILE * use buffered I/O unless you use setvbuf() to prevent it.
The other is in the kernel. Disk I/O normally goes into the kernel buffer pool, and eventually gets written by the kernel to disk. There are ways around that (O_DIRECT on Linux; raw devices on classic Unix; etc). The key point is that the write() system call normally writes to he kernel buffer pool. The kernel takes responsibility for ensuring that the data is written to disk safely and correctly (journalling, …).
The kernel doesn't write everything to disk immediately because (a) you may add more changes to the data, (b) other people may need to read or write the data, (c) the disk drive may be busy writing something else at the other end of its 1 TiB of storage and it will take time to get the write head in position to take your data, and it would be better for the overall performance of the system if it scheduled other work before writing your changed buffer to disk. It will get written to disk. It is just not defined when, and it could be fractions of a second or multiple seconds or longer, though most often it will not take minutes for the data to be written to disk.
These days, there could also be buffering in the RAID controllers, and maybe in the individual disks inside the RAID setup, and maybe there's network buffering too if it is a remotely-mounted file system. Those add extra levels of buffering.
The read() and write() and related low-level I/O functions do not have any client-side (application) buffering — unlike the standard C I/O functions.
A file is said to be buffered, when its contents are not outputted or inputted directly. Instead, the file's bytes are written to a temporary buffer in memory.
For example, if you are reading from a file, you are reading from the buffer. Once you have read all the characters in the buffer, it is replenished with new bytes from the file. The reason for this indirectness, is that a memory read is much faster than a hard disk read.
The calls read and write are low-level, and do not perform buffering. The stdio.h calls like getc and putc, do use buffering. These higher-level APIs only call the low level ones, when the buffer must be replenished.
Writing to the hard drive is much slower than writing to RAM. When you write to a drive it writes to memory, but doesn't always write to the disk immediately. The data might not be written to disk until that part of memory needs to be overwritten to make room for something else. This is called a Write-Back cache.

POSIX guaranteeing write to disk

As I understand it, if I want to synchronise data to the storage device I can use fsync() to supposedly flush all the OS output caches... but apparently it doesn't guarantee this at all, unlike the documentation tries to deceive you, and the data may not be written to the disk!
This is not very good for many purposes because it can lead to data corruption. How do I use the POSIX libraries (In a portable way if possible) to guarantee that the data has been written (as far as possible) and prevent data corruption?
There is fdatasync() but it is not implemented on OSX, so is there a better and more portable way, or does one have to implement different code on different systems? I'm also not sure if fdatasync() is good enough.
Of-course, in the worst case scenario I could forget about this and use a redundant database library that uses ACID to store the data. I don't want that.
Also I'm interested in how to ensure truncate and rename operations have definitely completed.
Thanks!
You are looking for sync. There is both a program called sync and a system call called sync (man 1 sync and man 2 sync respectively):
#include <unistd.h>
void sync(void);
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
So it will ensure that all pending operations (truncates, renames etc) are in fact written to the disk.
fsync does not claim to flush all output caches, but instead claims to flush all changes to a particular file descriptor to disk. It explicitly does not ensure that the directory entry is updated (in which case a call to fsync on a filedescriptor for the directory is needed).
fsyncdata is even more useless as it will not flush file metadata and instead will just ensure that the data in the file is flushed.
It is a good idea to trust the manpages. I won't say there are not mistakes, but they tend to be extremely accurate.

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

flush without sync

From what I've read, flush pushes data into the OS buffers and sync makes sure that data goes down to the storage media. So, if you want to be sure that data is actually written to disk, you need to do a flush followed by a sync. So, are there any cases where you want to call flush but not sync?
You only want to fflush if you're using stdio's FILE *. This writes a user space buffer to the kernel.
The other answers seem to be missing fdatasync. This is the system call you want to flush a specific file descriptor to disk.
When you fflush, you flush the buffer of one file to disk (unless you give NULL, in which case it flushes all open files). http://www.manpagez.com/man/3/fflush/
When you sync, you flush all the buffers to disk. http://www.manpagez.com/man/2/sync/
The most important thing that you should notice is that fflush is a standard function, while sync is a system call provided by the operating system (Linux for example).
So basically, if you are writing portable program, you in fact never use sync.
Yes, lots. Most programs most of the time would not bother to call any of the various sync operations; flushing the data into the kernel buffer pool as you close the file is sufficient. This is doubly true if you're using a journalled file system.
Note that flushing is a higher level operation than the read() or similar system calls. It is used by the C <stdio.h> library, or the C++ <iostream> library. The system calls inherently flush the data to the kernel buffer pool (or direct to disk if you're using direct I/O or something similar).
Note, too, that on POSIX-like systems, you can arrange for data sync etc by setting flags on the open() system call (O_SYNC, O_DSYNC, O_RSYNC), or subsequently via fcntl().
Just to clarify, fflush() applies only when using the FILE interface of UNIX that buffers writes at the application level. In case the normal write() call is used, fflush() makes little sense.
Having said that, I can think of two situations where you would like to call fflush() but not sync:
You want to make sure that the data will eventually make it to disk even though the application crashes.
Force to screen the data that the application has written to standard output so far.
The second case is the most common use I have seen and it is usually required if the printf() call does not end with a new line character ('\n').

Resources