This question already has answers here:
Does Linux guarantee the contents of a file is flushed to disc after close()?
(9 answers)
Closed 9 years ago.
When we call close(<fd>), does it automatically do fsync() to sync to the physical media?
It does not. Calling close() DOES NOT guarantee that contents are on the disk as the OS may have deferred the writes.
As a side note, always check the return value of close(). It will let you know of any deferred errors up to that point. And if you want to ensure that the contents are on the disk always call fsync() and check its return value as well.
One thing to keep in mind is what the backing store is. There are devices that may do internal write deferring and content can be lost in some cases (although newer storage media devices typically have super capacitors to prevent this, or ways to disable this feature).
No.
From man 2 close
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes. It
is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is
physically stored use fsync(2). (It will depend on the disk hardware at this point.)
From man 2 close:
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes. It is not
common for a file system to flush the buffers when the stream is
closed. If you need to be sure that the data is physically stored use
fsync(2). (It will depend on the disk hard-ware at this point.)
To answer your question, NO, close() does not guarantee fsync()
close only closes the file descriptor for the process and removes any record locks associated with the process.
Related
When I issue write(), my data goes to some kernel space buffers. The actual commit to physical layer ("phy-commit") is (likely) deferred, until.. (exactly until what events?)
When I issue a close() for a file descriptor, then
If [...], the resources associated with the open file description are freed
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
How to prevent that loss?
Via fsync()? It requests an explicite phy-commit. I suppose that either immediately (synchronous call) or deferred only "for short time" and queued to precede subsequent operations, at least destructive ones.
But I do not quite want immediate or urgent phy-commit. Only (retaining my data and) not to forget doing phy-commit later in some time.
From man fclose:
The fclose() function [...] closes the underlying file descriptor.
...
fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
No. The kernel buffers will not freed before it writes the data to the underlying file. So, there won't be any data loss (unless something goes really wrong - such as power outage to the system).
Whether the that data will be immediately written to the physical file is another question. It may dependent on the filesystem (which may be buffering) and/or any hardware caching as well.
As far as your user program is concerned, a successful close() call can be considered as successful write to the file.
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
After a call to close(), the state of the file descriptor is left unspecified by POSIX (regardless of whether close() succeeded). So, you are not allowed to use fsync(fd); after the calling close().
See: POSIX/UNIX: How to reliably close a file descriptor.
And no, it doesn't suggest close() can be destructive. It suggests that the C library may be doing its own buffering in the user and suggests to use fsync() to flush it to the kernel (and now, we are in the same position as said before).
Scenario:
Task code (error checking omitted):
// open, write and close
fd = open(name);
write(fd, buf, len);
close(fd);
< more code here **not** issuing read/writes to name but maybe open()ing it >
// open again and fsync
fd = open(name);
fsync(fd);
No more tasks accessing name concurrently in the system.
Is it defined, and more important, will it sync possible outstanding writes on the inode referred by name? ie, will I read back buf from the file after the fsync?
From POSIX http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html I would say it seems legit ...
Thanks.
Edit may 18:
Thanks for the answers and research. I took this question (in 2016) to one of the extfs lead developers (Ted) and got this answer: "It's not guaranteed by Posix, but in practice it should work on most
file systems, including ext4. The key wording in the Posix specification is:
The fsync() function shall request that all data for the open file
^^^^^^^^^^^^^^^^^
descriptor named by fildes is to be transferred to the storage device
^^^^^^^^^^^^^^^^^^^^^^^^^^
associated with the file described by fildes.
It does not say "all data for the file described by fildes...." it
says "all data for the open file descriptor". So technically data
written by another file descriptor is not guaranteed to be synced to
disk.
In practice, file systems don't try dirty data by which fd it came in
on, so you don't need to worry. And an OS which writes more than what
is strictly required is standards compliant, and so that's what you
will find in general, even if it isn't guaranteed." This is less specific than "exact same durabily guarrantees" but is quite authoritative, even though maybe outdated.
What I was trying to do was a 'sync' command that worked on single files.
Like fsync /some/file without having to sync the whole filesystem, to use it in shell scripts for example.
Now (since a few years ago) gnu coreutils 'sync' works on single files and does exactly this (open/fsync). commit: https://github.com/coreutils/coreutils/commit/8b2bf5295f353016d4f5e6a2317d55b6a8e7fd00
No, close()+re-open()+fsync() does not provide the same guarantees as fsync()+close().
Source: I took this question to the linux-fsdevel mailing list and got the answer:
Does a sequence of close()/re-open()/fsync() provide the same durability
guarantees as fsync()/close()?
The short answer is no, the latter provides a better guaranty.
The longer answer is that durability guarantees depends on kernel version,
because situation has been changing in v4.13, v4.14 and now again in
v4.17-rc and stable kernels.
Further relevant links are:
https://wiki.postgresql.org/wiki/Fsync_Errors ("fsyncgate")
Mailing list entry PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Writing programs to cope with I/O errors causing lost writes on Linux from the same author
In particular, the latter links describe how
after closing an FD, you lose all ways to enforce durability
after an fsync() fails, you cannot call fsync() again in the hope that now your data would be written
you must re-do/confirm all writing work if that happens
The current (2017) specification of POSIX fsync()
recognizes a base functionality and an optional functionality:
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.
[SIO] ⌦ If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion. ⌫
If _POSIX_SYNCHRONIZED_IO is not defined by the implementation, then your reopened file descriptor has no unwritten data to be transferred to the storage device, so the fsync() call is effectively a no-op.
If _POSIX_SYNCHRONIZED_IO is defined by the implementation, then your reopened file descriptor will ensure that all data written on any file descriptor associated with the file to be transferred to the storage device.
The section of the standard on Conformance has information about options and option groups.
The Definitions section has definitions 382..387 which defines aspects of Synchronized I/O and Synchronous I/O (yes, they're different — beware open file descriptors and open file descriptions, too).
The section on Realtime defers to the Definitions section for what synchronized I/O means.
It defines:
3.382 Synchronized Input and Output
A determinism and robustness improvement mechanism to enhance the data input and output mechanisms, so that an application can ensure that the data being manipulated is physically present on secondary mass storage devices.
3.383 Synchronized I/O Completion
The state of an I/O operation that has either been successfully transferred or diagnosed as unsuccessful.
3.384 Synchronized I/O Data Integrity Completion
For read, when the operation has been completed or diagnosed if unsuccessful. The read is complete only when an image of the data has been successfully transferred to the requesting process. If there were any pending write requests affecting the data to be read at the time that the synchronized read operation was requested, these write requests are successfully transferred prior to reading the data.
For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred.
File attributes that are not necessary for data retrieval (access time, modification time, status change time) need not be successfully transferred prior to returning to the calling process.
3.385 Synchronized I/O File Integrity Completion
Identical to a synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation (including access time, modification time, status change time) are successfully transferred prior to returning to the calling process.
3.386 Synchronized I/O Operation
An I/O operation performed on a file that provides the application assurance of the integrity of its data and files.
3.387 Synchronous I/O Operation
An I/O operation that causes the thread requesting the I/O to be blocked from further use of the processor until that I/O operation completes.
Note:
A synchronous I/O operation does not imply synchronized I/O data integrity completion or synchronized I/O file integrity completion.
It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes.
Conceptually, I think it should, but the wording isn't there in black and white (or black on pale yellow). It certainly should apply to any open file descriptors in the current process referring to the same file. It is not clear that it would apply to the previously opened (and closed) file descriptor in the current process. If it applies across all processes, then it should include the queued I/O from the current process. If it does not apply across all processes, it is possible that it does not.
In view of this and the rationale notes for fsync(), it is by far safest to assume that the fsync() operation has no effect on the queued operations associated with the closed file descriptor. If you want fsync() to be effective, call it before you close the file descriptor.
What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.
Suppose I write a block to a file descriptor without doing fsync and then read the same block from the same descriptor some time later. Is it guaranteed that I will receive the same information?
The program is single-threaded and no other process will access the file at any time.
Yes, it is guaranteed by the operating system.
Even if the modifications have not made it to disk yet, the OS uses its buffer cache to reflect file modifications and guarantees atomicity level for reads and writes, to ALL processes. So not only your process, but any other process, would be able to see the changes.
As to fsync(), it only instructs the operating system to do its best to flush the contents to disk. See also fdatasync().
Also, I suggest you use two file descriptors: one for reading, another for writing.
fsync() synchronizes cache and disk. Since the data is already in the cache, it will be read from there instead of from disk.
When you write to a file descriptor, the data is stored in ram caches and buffers before being sent to disk. So as long as you don't close the descriptor, you can access the data you just wrote. If you close the descriptor, the file contents must be put to disk either by flushing it yourself or waiting for the OS to do it for efficiency, BUT if you want to be assured to access the just written data on disk after opening a new FD, you MUST flush to disk with fsync().
I know that when you call fwrite or fprintf or rather any other function that writes to a file, the contents aren't immediately flushed to the disk, but buffered in the memory.
Firstly, where do the OS manage these buffers and how. Secondly, if you do the write to a file and later read in the content you wrote and assuming that the OS didn't flushed the contents between the time you wrote and read, how it knows that it has to return the read from the buffer? How does it handle this situation.
The reason I want to know this is that I'm interested in implementing my own buffering scheme in user-space, rather than kernel space as done by OS. That is, write to a file would be buffered in user-space and the actual write will only occur at a certain point. Consquently I also need to handle situations where read is called for the content that is still in the buffer. Is it possible to do all this in user-space.
Firstly, where do the OS manage these buffers and how
The functions fwrite and fprintf use stdio buffers which already are completely in userspace. The buffers are (likely) static arrays or perhaps malloced memory.
how it knows that it has to return the read from the buffer
It doesn't, so the changes aren't seen. Nothing actually happens to a file until the underlying system call (write) is called (and even then - read on).
Is it possible to do all this in user-space
No, it's not possible. The good news is that the kernel already has buffers so every write you do isn't atually translated into an actual write to the file. It is postponed and executed later. If in the meantime somebody tries to read from the file, the kernel is smart enough to serve him from the buffer.
Bits from TLPI:
When working with disk files, the read() and write() system calls
don’t directly ini- tiate disk access. Instead, they simply copy data
between a user-space buffer and a buffer in the kernel buffer cache.
When performing I/O on a disk file, a successful return from write()
doesn’t guarantee that the data has been transferred to disk, because
the kernel performs buffering of disk I/O in order to reduce disk
activity and expedite write() calls.
At some later point, the kernel writes (flushes) its buffer to the
disk.
If, in the interim, another process attempts to read these bytes of
the file, then the kernel automatically supplies the data from the
buffer cache, rather than from (the outdated contents of) the file.
So you may want to find out about sync and fsync.
Multiple levels of buffering are generally bad. The reason stdio buffers are useful is that they minimize the number of system calls performed. If a system call would be cheaper nobody would use stdio buffers anymore.