Calling fsync(2) after close(2) - c

Scenario:
Task code (error checking omitted):
// open, write and close
fd = open(name);
write(fd, buf, len);
close(fd);
< more code here **not** issuing read/writes to name but maybe open()ing it >
// open again and fsync
fd = open(name);
fsync(fd);
No more tasks accessing name concurrently in the system.
Is it defined, and more important, will it sync possible outstanding writes on the inode referred by name? ie, will I read back buf from the file after the fsync?
From POSIX http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html I would say it seems legit ...
Thanks.
Edit may 18:
Thanks for the answers and research. I took this question (in 2016) to one of the extfs lead developers (Ted) and got this answer: "It's not guaranteed by Posix, but in practice it should work on most
file systems, including ext4. The key wording in the Posix specification is:
The fsync() function shall request that all data for the open file
^^^^^^^^^^^^^^^^^
descriptor named by fildes is to be transferred to the storage device
^^^^^^^^^^^^^^^^^^^^^^^^^^
associated with the file described by fildes.
It does not say "all data for the file described by fildes...." it
says "all data for the open file descriptor". So technically data
written by another file descriptor is not guaranteed to be synced to
disk.
In practice, file systems don't try dirty data by which fd it came in
on, so you don't need to worry. And an OS which writes more than what
is strictly required is standards compliant, and so that's what you
will find in general, even if it isn't guaranteed." This is less specific than "exact same durabily guarrantees" but is quite authoritative, even though maybe outdated.
What I was trying to do was a 'sync' command that worked on single files.
Like fsync /some/file without having to sync the whole filesystem, to use it in shell scripts for example.
Now (since a few years ago) gnu coreutils 'sync' works on single files and does exactly this (open/fsync). commit: https://github.com/coreutils/coreutils/commit/8b2bf5295f353016d4f5e6a2317d55b6a8e7fd00

No, close()+re-open()+fsync() does not provide the same guarantees as fsync()+close().
Source: I took this question to the linux-fsdevel mailing list and got the answer:
Does a sequence of close()/re-open()/fsync() provide the same durability
guarantees as fsync()/close()?
The short answer is no, the latter provides a better guaranty.
The longer answer is that durability guarantees depends on kernel version,
because situation has been changing in v4.13, v4.14 and now again in
v4.17-rc and stable kernels.
Further relevant links are:
https://wiki.postgresql.org/wiki/Fsync_Errors ("fsyncgate")
Mailing list entry PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Writing programs to cope with I/O errors causing lost writes on Linux from the same author
In particular, the latter links describe how
after closing an FD, you lose all ways to enforce durability
after an fsync() fails, you cannot call fsync() again in the hope that now your data would be written
you must re-do/confirm all writing work if that happens

The current (2017) specification of POSIX fsync()
recognizes a base functionality and an optional functionality:
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.
[SIO] ⌦ If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion. ⌫
If _POSIX_SYNCHRONIZED_IO is not defined by the implementation, then your reopened file descriptor has no unwritten data to be transferred to the storage device, so the fsync() call is effectively a no-op.
If _POSIX_SYNCHRONIZED_IO is defined by the implementation, then your reopened file descriptor will ensure that all data written on any file descriptor associated with the file to be transferred to the storage device.
The section of the standard on Conformance has information about options and option groups.
The Definitions section has definitions 382..387 which defines aspects of Synchronized I/O and Synchronous I/O (yes, they're different — beware open file descriptors and open file descriptions, too).
The section on Realtime defers to the Definitions section for what synchronized I/O means.
It defines:
3.382 Synchronized Input and Output
A determinism and robustness improvement mechanism to enhance the data input and output mechanisms, so that an application can ensure that the data being manipulated is physically present on secondary mass storage devices.
3.383 Synchronized I/O Completion
The state of an I/O operation that has either been successfully transferred or diagnosed as unsuccessful.
3.384 Synchronized I/O Data Integrity Completion
For read, when the operation has been completed or diagnosed if unsuccessful. The read is complete only when an image of the data has been successfully transferred to the requesting process. If there were any pending write requests affecting the data to be read at the time that the synchronized read operation was requested, these write requests are successfully transferred prior to reading the data.
For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred.
File attributes that are not necessary for data retrieval (access time, modification time, status change time) need not be successfully transferred prior to returning to the calling process.
3.385 Synchronized I/O File Integrity Completion
Identical to a synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation (including access time, modification time, status change time) are successfully transferred prior to returning to the calling process.
3.386 Synchronized I/O Operation
An I/O operation performed on a file that provides the application assurance of the integrity of its data and files.
3.387 Synchronous I/O Operation
An I/O operation that causes the thread requesting the I/O to be blocked from further use of the processor until that I/O operation completes.
Note:
A synchronous I/O operation does not imply synchronized I/O data integrity completion or synchronized I/O file integrity completion.
It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes.
Conceptually, I think it should, but the wording isn't there in black and white (or black on pale yellow). It certainly should apply to any open file descriptors in the current process referring to the same file. It is not clear that it would apply to the previously opened (and closed) file descriptor in the current process. If it applies across all processes, then it should include the queued I/O from the current process. If it does not apply across all processes, it is possible that it does not.
In view of this and the rationale notes for fsync(), it is by far safest to assume that the fsync() operation has no effect on the queued operations associated with the closed file descriptor. If you want fsync() to be effective, call it before you close the file descriptor.

Related

Blocked vs unblocked I/O buffering?

I was reading a bit about z/OS's concepts of blocked IO. It states:
Blocked I/O is an extension to the ISO standard. For files opened in block format, z/OS® XL C/C++ reads and writes one block at a time. If you try to write more data to a block than the block can hold, the data is truncated. For blocked I/O, z/OS XL C/C++ allows only the use of fread() and fwrite() to read and write to files.
Then it goes to say:
The fflush() function has no effect for blocked I/O files.
However, in another article, it says:
For terminals, because I/O is always unblocked, line buffering is
equivalent to full buffering.
For record I/O files, buffering is
meaningful only for blocked files or for record I/O files in z/OS UNIX
file system using full buffering. For unblocked files, the buffer is
full after every write and is therefore written immediately, leaving
nothing to flush. For blocked files or fully-buffered UNIX file system
files, however, the buffer can contain one or more records that have
not been flushed and that require a flush operation for them to go to
the system.
For blocked I/O files, buffering is always meaningless.
I'm extremely confused by all this. If I/O is unblocked, how would line buffering be equivalent to full buffering? Why wouldn't flush make a difference in block I/O? In addition, what does it mean that blocked I/O cause buffering to be always meaningless? Any intuition regarding what's happening here with blocked vs unblocked I/O and how it plays into the effects of buffering would be much appreciated.
My take on what you provided is that this is referring to Blocked I/O for MVS datasets. These would be different than files stored in Unix System Services HFS / ZFS. And different than terminal I/O.
I'm extremely confused by all this. If I/O is unblocked, how would
line buffering be equivalent to full buffering?
I think your referring to the reference to terminal I/O which indicates that a line is a record and is the same as block size so every record is a full block of data. Which is to say an LRECL = BLKSIZE == 1 record per block so its not buffered, or, the buffer is the record.
Why wouldn't flush make a difference in block I/O?
Where there is more than one record per block fflush will not write a block until the block is full. I suspect it has to do with the I/O implementation in z/OS wich predated C on the platform so they made a design decision to no cause different behaviours for different languages in how I/O is conducted.
In addition, what does it mean that blocked I/O cause buffering to be always meaningless?
Again, z/OS writes full blocks except for the last block in a file which may be short because it does not contain enough records for a full block.
There was a lot of history in z/OS before C came to the platform and z/OS goes to great lengths to provide consistency.

is there an official document that mark read/write function as thread-safe functions?

the man pages of read/write didn't mention anything about their thread-safety
According to this link!
i understood this functions are thread safe but in this comment there is not a link to an official document.
In other hand according to this link! which says:
The read() function shall attempt to read nbyte bytes
from the file associated with the open file descriptor,
fildes, into the buffer pointed to by buf.
The behavior of multiple concurrent reads on the same pipe, FIFO, or
terminal device is unspecified.
I concluded the read function is not thread safe.
I am so confused now. please send me a link to official document about thread-safety of this functions.
i tested this functions with pipe but there wasn't any problem.(of course i know i couldn't state any certain result by testing some example)
thanks in advance:)
The thread safe versions of read and write are pread and pwrite:
pread(2)
The pread() and pwrite() system calls are especially useful in
multithreaded applications. They allow multiple threads to perform
I/O on the same file descriptor without being affected by changes to
the file offset by other threads.
when two threads write() at the same time the order is not specified (which write call completes first) therefore the behaviour is unspecified (without synchronization)
read() and write() are not strictly thread-safe, and there is no documentation that says they are, as the location where the data is read from or written to can be modified by another thread.
Per the POSIX read documentation (note the bolded parts):
The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer pointed to by buf. The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
That's the part you noticed - but that does not cover all possible types of file descriptors, such as regular files. It only applies to "pipe[s], FIFO[s]" and "terminal device[s]". This part covers almost everything else (weird things like "files" in /proc that are generated on the fly by the kernel are, well, weird and highly implementation-specific):
On files that support seeking (for example, a regular file), the read() shall start at a position in the file given by the file offset associated with fildes. The file offset shall be incremented by the number of bytes actually read.
Since the "file offset associated with fildes" is subject to modification from other threads in the process, the following code is not guaranteed to return the same results even given the exact same file contents and inputs for fd, offset, buffer, and bytes:
lseek( fd, offset, SEEK_SET );
read( fd, buffer, bytes );
Since both read() and write() depend upon a state (current file offset) that can be modified at any moment by another thread, they are not tread-safe.
On some embedded file systems, or really old desktop systems that weren't designed to facilitate multitasking support (e.g. MS-DOS 3.0), an attempt to perform an fread() on one file while an fread() is being performed on another file may result in arbitrary system corruption.
Any modern operating system and language runtime will guarantee that such corruption won't occur as a result of operations performed on unrelated files, or when independent file descriptors are used to access the same file in ways that do not modify it. Functions like fread() and fwrite() will be thread-safe when used in that fashion.
The act of reading data from a disk file does not modify it, but reading data from many kinds of stream will modify them by removing data. If two threads both perform actions that modify the same stream, such actions may interfere with each other in unspecified ways even if such modifications are performed by fread() operations.

What is the difference between fsync and syncfs?

What is the difference between fsync and syncfs ?
int syncfs(int fd);
int fsync(int fd);
The manpage for fync says the following :
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages
for) the file referred to by the file descriptor fd to the disk device (or other permanent stor‐
age device) so that all changed information can be retrieved even after the system crashed or
was rebooted. This includes writing through or flushing a disk cache if present. The call
blocks until the device reports that the transfer has completed. It also flushes metadata
information associated with the file (see stat(2)).
The manpage for syncfs says the following:
sync() causes all buffered modifications to file metadata and data to be written to the underly‐
ing filesystems.
syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the
open file descriptor fd.
For me both seem equal. They are sychronizing the file referred by the filedescriptor and the associated metadata.
First, fsync() (and sync()) are POSIX-standard functions while syncfs() is Linux-only.
So availability is one big difference.
From the POSIX standard for fsync():
The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage
device associated with the file described by fildes. The nature of
the transfer is implementation-defined. The fsync() function shall
not return until the system has completed that action or until an
error is detected.
Note that it's just a request.
From the POSIX standard for sync():
The sync() function shall cause all information in memory that
updates file systems to be scheduled for writing out to all file
systems.
The writing, although scheduled, is not necessarily complete upon
return from sync().
Again, that's not something guaranteed to happen.
The Linux man page for syncfs() (and sync()) states
sync() causes all pending modifications to filesystem metadata and
cached file data to be written to the underlying filesystems.
syncfs() is like sync(), but synchronizes just the filesystem
containing file referred to by the open file descriptor fd.
Note that when the function returns is unspecified.
The Linux man page for fsync() states:
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage device)
so that all changed information can be retrieved even if the system
crashes or is rebooted. This includes writing through or flushing a
disk cache if present. The call blocks until the device reports that
the transfer has completed.
As well as flushing the file data, fsync() also flushes the metadata
information associated with the file (see inode(7)).
Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk. For that an
explicit fsync() on a file descriptor for the directory is also
needed.
Note that the guarantees Linux provides for fsync() are much stronger than those provided for sync() or syncfs(), and by POSIX for both fsync() and sync().
In summary:
POSIX fsync(): "please write data for this file to disk"
POSIX sync(): "write all data to disk when you get around to it"
Linux sync(): "write all data to disk (when you get around to it?)"
Linux syncfs(): "write all data for the filesystem associated with this file to disk (when you get around to it?)"
Linux fsync(): "write all data and metadata for this file to disk, and don't return until you do"
Note that the Linux man page doesn't specify when sync() and syncfs() return.
I think the current answer is not complete. For Linux:
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is
done. However Linux waits for I/O completions, and thus sync() or
syncfs() provide the same guarantees as fsync called on every file
in the system or filesystem respectively.
and
Before version 1.3.20 Linux did not wait for I/O to complete before
returning.
This is mentioned on the sync(2) page in the "notes" and "bugs" sections.

How to prevent data loss when closing a file descriptor?

When I issue write(), my data goes to some kernel space buffers. The actual commit to physical layer ("phy-commit") is (likely) deferred, until.. (exactly until what events?)
When I issue a close() for a file descriptor, then
If [...], the resources associated with the open file description are freed
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
How to prevent that loss?
Via fsync()? It requests an explicite phy-commit. I suppose that either immediately (synchronous call) or deferred only "for short time" and queued to precede subsequent operations, at least destructive ones.
But I do not quite want immediate or urgent phy-commit. Only (retaining my data and) not to forget doing phy-commit later in some time.
From man fclose:
The fclose() function [...] closes the underlying file descriptor.
...
fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
No. The kernel buffers will not freed before it writes the data to the underlying file. So, there won't be any data loss (unless something goes really wrong - such as power outage to the system).
Whether the that data will be immediately written to the physical file is another question. It may dependent on the filesystem (which may be buffering) and/or any hardware caching as well.
As far as your user program is concerned, a successful close() call can be considered as successful write to the file.
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
After a call to close(), the state of the file descriptor is left unspecified by POSIX (regardless of whether close() succeeded). So, you are not allowed to use fsync(fd); after the calling close().
See: POSIX/UNIX: How to reliably close a file descriptor.
And no, it doesn't suggest close() can be destructive. It suggests that the C library may be doing its own buffering in the user and suggests to use fsync() to flush it to the kernel (and now, we are in the same position as said before).

how standard specify atomic write to regular file(not pipe or fifo)?

The posix standard specified that when write less than PIPE_BUF bytes to pipe or FIFO are granted atomic, that is, our write doesn't mix with other processes'. But I failed to find out how standard specify about regular file. I mean it's true that when we write less than PIPE_BUF, it will also granted be atomic. But I want to know does regular file have such limitation? I mean, the pipe has the capacity, so that when write to the pipe and beyond its capacity, kernel will put the writer to sleep, so other process will get chance to write, but regular file seems that doesn't have to have such limitation, am i right?
What I'm doing is several processes generate log to a file. Of course, with O_APPEND set.
Quote from http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm (Single UNIX Specification, Version 4, 2010 Edition):
This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control.
The specification does address how semantics of writes regarding writes occur in case of multiple readers, but as you can see from above, the behaviour for multiple, concurrent writers is not defined by the specification.
Note above talks about files. For pipes and FIFOs the PIPE_MAX semantics apply, that concurrent writes are guaranteed to be non-divisible up to PIPE_MAX bytes.
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
For real file systems the situation is complex. Some local file systems may enforce atomic writes up to arbitrary sizes (memory limit) by locking a file handle during writing, some might not (I tried to look at ext4 logic, but lost track somewhere around http://lxr.linux.no/linux+v3.5.3/fs/jbd2/transaction.c#L147).
For non-local file systems the result is more or less for grabs. Just don't try concurrent writing on a networked file system without some form of explicit locking (or you're positively absolutely sure about the semantics of the network file system you're using).
BTW, O_APPEND guarantees that all writes by different processes go to the end of the file. However as SUS above notes, if the writes are really concurrent (occuring at the same time), then the behavior is undefined. On earlier uniprocess and non-pre-emptive UNIXes this didn't really matter, as a call to write(2) completed before someone else got a chance to write...
This question could be answered definitely for specific combination of operating system (Linux?) and file system (ext4?). A general answer? As SUS reads -- "undefined behavior".
I think this is useful to you: "the data written by writev() is written as a single block that is not intermingled with output from writes in other processes", so you can use writev
Several writers to a file may mix up things. But files opened with O_APPEND are appended atomically per write access.
If you want to keep to the C stdio interface instead of the lower level one, fopene the file with "a" or "a+" (which map to O_APPEND), set up a buffer large enough that there is no need to write inside your records and use fsync to force the write when you are done building them. I'm not sure it is guaranteed by POSIX (C says nothing about that).
There is the ultimate solut8ion to all questions of atomicity; a mutex. Wrap your writes to the log file in a mutex and all will be done atomically.
A simpler solution might be to use the GLOG libraries from Google. A fantastic logging system, far better than anything I ever dreamed up, free, not-GPL, and atomic.
One way to interleave them safely would be to have all writers lock the file, write, and unlock.
Functions that can be used for locking are flock(), lockf(), and fcntl().
Beware that ALL writers must lock (and they should all use the same mechanism to do the locking) or one that doesn't bother getting a lock could still write at the same time as another that holds a lock.

Resources