When I issue write(), my data goes to some kernel space buffers. The actual commit to physical layer ("phy-commit") is (likely) deferred, until.. (exactly until what events?)
When I issue a close() for a file descriptor, then
If [...], the resources associated with the open file description are freed
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
How to prevent that loss?
Via fsync()? It requests an explicite phy-commit. I suppose that either immediately (synchronous call) or deferred only "for short time" and queued to precede subsequent operations, at least destructive ones.
But I do not quite want immediate or urgent phy-commit. Only (retaining my data and) not to forget doing phy-commit later in some time.
From man fclose:
The fclose() function [...] closes the underlying file descriptor.
...
fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
Does it mean releasing (freeing) those kernel buffers which contained my data? What will happen to my precious data, contained in those buffers? Will be lost?
No. The kernel buffers will not freed before it writes the data to the underlying file. So, there won't be any data loss (unless something goes really wrong - such as power outage to the system).
Whether the that data will be immediately written to the physical file is another question. It may dependent on the filesystem (which may be buffering) and/or any hardware caching as well.
As far as your user program is concerned, a successful close() call can be considered as successful write to the file.
It may suggest that fsync does not have to precede close (or fclose, which contains a close), but can (even have to) come after it. So the close() cannot be very destructive...
After a call to close(), the state of the file descriptor is left unspecified by POSIX (regardless of whether close() succeeded). So, you are not allowed to use fsync(fd); after the calling close().
See: POSIX/UNIX: How to reliably close a file descriptor.
And no, it doesn't suggest close() can be destructive. It suggests that the C library may be doing its own buffering in the user and suggests to use fsync() to flush it to the kernel (and now, we are in the same position as said before).
Related
Assuming I have opened dev/poll as mDevPoll, is it safe for me to call code like this
struct pollfd tmp_pfd;
tmp_pfd.fd = fd;
tmp_pfd.events = POLLIN;
// Write pollfd to /dev/poll
write(mDevPoll, &tmp_pfd, sizeof(struct pollfd));
...simultaneously from multiple threads, or do I need to add my own synchronisation primitive around mDevPoll?
Solaris 10 claims to be POSIX compliant. The write() function is not among the handful of system interfaces that POSIX permits to be non-thread-safe, so we can conclude that that on Solaris 10, it is safe in a general sense to call write() simultaneously from two or more threads.
POSIX also designates write() among those functions whose effects are atomic relative to each other when they operate on regular files or symbolic links. Specifically, it says that
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
If your writes were directed to a regular file then that would be sufficient to conclude that your proposed multi-thread actions are safe, in the sense that they would not interfere with one another, and the data written in one call would not be commingled with that written by a different call in any thread. Unfortunately, /dev/poll is not a regular file, so that does not apply directly to you.
You should also be aware that write() is not in general required to transfer the full number of bytes specified in a single call. For general purposes, one must therefore be prepared to transfer the desired bytes over multiple calls, by using a loop. Solaris may provide applicable guarantees beyond those expressed by POSIX, perhaps specific to the destination device, but absent such guarantees it is conceivable that one of your threads performs a partial write, and the next write is performed by a different thread. That very likely would not produce the results you want or expect.
It's not safe in theory, even though write() is completely thread-safe (barring implementation bugs...). Per the POSIX write() standard (emphasis mine):
.
The write() function shall attempt to write nbyte bytes from the
buffer pointed to by buf to the file associated with the open file
descriptor, fildes.
...
RETURN VALUE
Upon successful completion, these functions shall return the number of bytes actually written ...
There is no guarantee that you won't get a partial write(), so even if each individual write() call is atomic, it's not necessarily complete, so you could still get interleaved data because it may take more than one call to write() to completely write all data.
In practice, if you're only doing relatively small write() calls, you will likely never see a partial write(), with "small" and "likely" being indeterminate values dependent on your implementation.
I've routinely delivered code that uses unlocked single write() calls on regular files opened with O_APPEND in order to improve the performance of logging - build a log entry then write() the entire entry with one call. I've never seen a partial or interleaved write() result over almost a couple of decades of doing that on Linux and Solaris systems, even when many processes write to the same log file. But then again, it's a text log file and if a partial or interleaved write() does happen there would be no real damage done or even data lost.
In this case, though, you're "writing" a handful of bytes to a kernel structure. You can dig through the Solaris /dev/poll kernel driver source code at Illumos.org and see how likely a partial write() is. I'd suspect it's practically impossible - because I just went back and looked at the multiplatform poll class that I wrote for my company's software library a decade ago. On Solaris it uses /dev/poll and unlocked write() calls from multiple threads. And it's been working fine for a decade...
Solaris /dev/pool Device Driver Source Code Analysis
The (Open)Solaris source code can be found here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/devpoll.c#628
The dpwrite() function is the code in the /dev/poll driver that actually performs the "write" operation. I use quotes because it's not really a write operation at all - data isn't transferred as much as the data in the kernel that represents the set of file descriptors being polled is updated.
Data is copied from user space into kernel space - to a memory buffer obtained with kmem_alloc(). I don't see any possible way that can be a partial copy. Either the allocation succeeds or it doesn't. The code can get interrupted before doing anything, as it wait for exclusive write() access to the kernel structures.
After that, the last return call is at the end - and if there's no error, the entire call is marked successful, or the entire call fails on any error:
995 if (error == 0) {
996 /*
997 * The state of uio_resid is updated only after the pollcache
998 * is successfully modified.
999 */
1000 uioskip(uiop, copysize);
1001 }
1002 return (error);
1003}
If you dig through Solaris kernel code, you'll see that uio_resid is what ends up being the value returned by write() after a successful call.
So the call certainly appears to be all-or-nothing. While there appear to be ways for the code to return an error on a file descriptor after successfully processing an earlier descriptor when multiple descriptors are passed in, the code doesn't appear to return any partial success indications.
If you're only processing one file descriptor at a time, I'd say the /dev/poll write() operation is completely thread-safe, and it's almost certainly thread-safe for "writing" updates to multiple file descriptors as there's no apparent way for the driver to return a partial write() result.
This question already has answers here:
Does Linux guarantee the contents of a file is flushed to disc after close()?
(9 answers)
Closed 9 years ago.
When we call close(<fd>), does it automatically do fsync() to sync to the physical media?
It does not. Calling close() DOES NOT guarantee that contents are on the disk as the OS may have deferred the writes.
As a side note, always check the return value of close(). It will let you know of any deferred errors up to that point. And if you want to ensure that the contents are on the disk always call fsync() and check its return value as well.
One thing to keep in mind is what the backing store is. There are devices that may do internal write deferring and content can be lost in some cases (although newer storage media devices typically have super capacitors to prevent this, or ways to disable this feature).
No.
From man 2 close
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes. It
is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is
physically stored use fsync(2). (It will depend on the disk hardware at this point.)
From man 2 close:
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes. It is not
common for a file system to flush the buffers when the stream is
closed. If you need to be sure that the data is physically stored use
fsync(2). (It will depend on the disk hard-ware at this point.)
To answer your question, NO, close() does not guarantee fsync()
close only closes the file descriptor for the process and removes any record locks associated with the process.
What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.
From what I've read, flush pushes data into the OS buffers and sync makes sure that data goes down to the storage media. So, if you want to be sure that data is actually written to disk, you need to do a flush followed by a sync. So, are there any cases where you want to call flush but not sync?
You only want to fflush if you're using stdio's FILE *. This writes a user space buffer to the kernel.
The other answers seem to be missing fdatasync. This is the system call you want to flush a specific file descriptor to disk.
When you fflush, you flush the buffer of one file to disk (unless you give NULL, in which case it flushes all open files). http://www.manpagez.com/man/3/fflush/
When you sync, you flush all the buffers to disk. http://www.manpagez.com/man/2/sync/
The most important thing that you should notice is that fflush is a standard function, while sync is a system call provided by the operating system (Linux for example).
So basically, if you are writing portable program, you in fact never use sync.
Yes, lots. Most programs most of the time would not bother to call any of the various sync operations; flushing the data into the kernel buffer pool as you close the file is sufficient. This is doubly true if you're using a journalled file system.
Note that flushing is a higher level operation than the read() or similar system calls. It is used by the C <stdio.h> library, or the C++ <iostream> library. The system calls inherently flush the data to the kernel buffer pool (or direct to disk if you're using direct I/O or something similar).
Note, too, that on POSIX-like systems, you can arrange for data sync etc by setting flags on the open() system call (O_SYNC, O_DSYNC, O_RSYNC), or subsequently via fcntl().
Just to clarify, fflush() applies only when using the FILE interface of UNIX that buffers writes at the application level. In case the normal write() call is used, fflush() makes little sense.
Having said that, I can think of two situations where you would like to call fflush() but not sync:
You want to make sure that the data will eventually make it to disk even though the application crashes.
Force to screen the data that the application has written to standard output so far.
The second case is the most common use I have seen and it is usually required if the printf() call does not end with a new line character ('\n').
I know that when you call fwrite or fprintf or rather any other function that writes to a file, the contents aren't immediately flushed to the disk, but buffered in the memory.
Firstly, where do the OS manage these buffers and how. Secondly, if you do the write to a file and later read in the content you wrote and assuming that the OS didn't flushed the contents between the time you wrote and read, how it knows that it has to return the read from the buffer? How does it handle this situation.
The reason I want to know this is that I'm interested in implementing my own buffering scheme in user-space, rather than kernel space as done by OS. That is, write to a file would be buffered in user-space and the actual write will only occur at a certain point. Consquently I also need to handle situations where read is called for the content that is still in the buffer. Is it possible to do all this in user-space.
Firstly, where do the OS manage these buffers and how
The functions fwrite and fprintf use stdio buffers which already are completely in userspace. The buffers are (likely) static arrays or perhaps malloced memory.
how it knows that it has to return the read from the buffer
It doesn't, so the changes aren't seen. Nothing actually happens to a file until the underlying system call (write) is called (and even then - read on).
Is it possible to do all this in user-space
No, it's not possible. The good news is that the kernel already has buffers so every write you do isn't atually translated into an actual write to the file. It is postponed and executed later. If in the meantime somebody tries to read from the file, the kernel is smart enough to serve him from the buffer.
Bits from TLPI:
When working with disk files, the read() and write() system calls
don’t directly ini- tiate disk access. Instead, they simply copy data
between a user-space buffer and a buffer in the kernel buffer cache.
When performing I/O on a disk file, a successful return from write()
doesn’t guarantee that the data has been transferred to disk, because
the kernel performs buffering of disk I/O in order to reduce disk
activity and expedite write() calls.
At some later point, the kernel writes (flushes) its buffer to the
disk.
If, in the interim, another process attempts to read these bytes of
the file, then the kernel automatically supplies the data from the
buffer cache, rather than from (the outdated contents of) the file.
So you may want to find out about sync and fsync.
Multiple levels of buffering are generally bad. The reason stdio buffers are useful is that they minimize the number of system calls performed. If a system call would be cheaper nobody would use stdio buffers anymore.