This question already has answers here:
Atomicity of `write(2)` to a local filesystem
(4 answers)
Closed 4 years ago.
I was reading the APUE(Advanced Programming in the UNIX Environment), and come across this question when I see $3.11:
if (lseek(fd, 0L, 2) < 0) /* position to EOF */
err_sys("lseek error");
if (write(fd, buf, 100) != 100) /* and write */
err_sys("write error")
APUE says:
This works fine for a single process, but problems arise if multiple processes use this technique to append to the same file. .......The problem here is that our logical operation of ‘‘position to the end of file and
write’’ requires two separate function calls (as we’ve shown it). Any operation that requires more than one function call cannot be atomic, as there is always the possibility that the kernel might temporarily suspend the process between the two function calls.
It just says cpu will switch between function calls between lseek and write, I want to know if it will also switch in half write operation? Or rather, is write atomic? If threadA writes "aaaaa", threadB writes "bbbbb", will the result be "aabbbbbaaa"?
What's more,after that APUE says pread and pwrite are all atomic operations, does that mean these functions use mutex or lock internally to be atomic?
To call the Posix semantics "atomic" is perhaps an oversimplification. Posix requires that reads and writes occur in some order:
Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data from write() calls to subsequent read() calls. (from the Rationale section of the Posix specification for pwrite and write)
The atomicity guarantee mentioned in APUE refers to the use of the O_APPEND flag, which forces writes to be performed at the end of the file:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
With respect to pread and pwrite, APUE says (correctly, of course) that these interfaces allow the application to seek and perform I/O atomically; in other words, that the I/O operation will occur at the specified file position regardless of what any other process does. (Because the position is specified in the call itself, and does not affect the persistent file position.)
The Posix sequencing guarantee is as follows (from the Description of the write() and pwrite() functions):
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.
Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
As mentioned in the Rationale, this wording does guarantee that two simultaneous write calls (even in different unrelated processes) will not interleave data, because if data were interleaved during a write which will eventually succeed the second guarantee would be impossible to provide. How this is accomplished is up to the implementation.
It must be noted that not all filesystems conform to Posix, and modular OS design, which allows multiple filesystems to coexist in a single installation, make it impossible for the kernel itself to provide guarantees about write which apply to all available filesystems. Network filesystems are particularly prone to data races (and local mutexes won't help much either), as is mentioned as well by Posix (at the end of the paragraph quoted from the Rationale):
This requirement is particularly significant for networked file systems, where some caching schemes violate these semantics.
The first guarantee (about subsequent reads) requires some bookkeeping in the filesystem, because data which has been successfully "written" to a kernel buffer but not yet synched to disk must be made transparently available to processes reading from that file. This also requires some internal locking of kernel metadata.
Since writing to regular files is typically accomplished via kernel buffers and actually synching the data to the physical storage device is definitely not atomic, the locks necessary to provide these guarantee don't have to be very long-lasting. But they must be done inside the filesystem because nothing in the Posix wording limits the guarantees to simultaneous writes within a single threaded process.
Within a multithreaded process, Posix does require read(), write(), pread() and pwrite() to be atomic when they operate on regular files (or symbolic links). See Thread Interactions with Regular File Operations for a complete list of interfaces which must obey this requirement.
In Linux there are blocking and non-blocking system calls. The write is an example of blocking system call, which means the execution thread will be blocked until the write completes. So once the user process called write, it can not execute anything else until the system call is complete. So from user thread perspective it will behave like atomic [although at kernel level lot many things can happen and kernel execution of system call can be interrupted many times].
Related
Suppose two different processes open the same file independently, and so have different entries in the Open file table (system-wide). But they refer to the same i-node entry.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different? And how does the kernel avoid it?
Book: The Linux Programming Interface; Page no. 95; Chapter-5 (File I/O: Further details); Section 5.4
(I'm assuming because you used write() that the question refers to POSIX systems.)
Each write() operation is supposed to be fully atomic, assuming a POSIX system (presumed from the use of write()).
Per POSIX 7's 2.9.7 Thread Interactions with Regular File Operations:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2017 when they operate on
regular files or symbolic links:
chmod()
chown()
close()
creat()
dup2()
fchmod()
fchmodat()
fchown()
fchownat()
fcntl()
fstat()
fstatat()
ftruncate()
lchown()
link()
linkat()
lseek()
lstat()
open()
openat()
pread()
read()
readlink()
readlinkat()
readv()
pwrite()
rename()
renameat()
stat()
symlink()
symlinkat()
truncate()
unlink()
unlinkat()
utime()
utimensat()
utimes()
write()
writev()
If two threads each call one of these functions, each call shall
either see all of the specified effects of the other call, or none of
them. The requirement on the close() function shall also apply
whenever a file descriptor is successfully closed, however caused (for
example, as a consequence of calling close(), calling dup2(), or of
process termination).
But pay particular attention to the specification for write() (bolding mine):
The write() function shall attempt to write nbyte bytes ...
POSIX says that write() calls to a file shall be atomic. POSIX does not say that the write() calls will be complete. Here's a Linux bug report where a signal was interrupting a write() that was partially complete. Note the explanation:
Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
That's all but admitting that there's a POSIX requirement for write() calls to be atomic, if not complete, with an offer to revert back to earlier behavior where the write() calls apparently were all also complete in this same circumstance.
Note, though, there are lots of file systems out there that don't conform to POSIX standards.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different?
Any write() in Linux can return a short count, for example due to a signal being delivered to an userspace handler. For simplicity, let's ignore that, and only consider what happens to the successfully written data.
There are two scenarios:
The regions written to do not overlap.
(For example, one process writes 100 bytes starting at offset 23, and another writes 50 bytes starting at offset 200.)
There is no race condition in this case.
The regions written to do overlap.
(For example, one process writes 100 bytes starting at offset 50, and another writes 10 bytes starting at offset 70.)
There is a race condition. It is impossible to predict (without advisory locks etc.) the order in which the data gets updated.
Depending on the target filesystem, and if the writes are large enough (so that paging effects can be observed), the two writes may even be "mixed" (in page-sized chunks) in Linux on some filesystems on machines with more than one hardware thread, even though POSIX says this shouldn't happen.
Normally, writes go through the Linux page cache. It is possible for one of the processes to have opened the file with O_DIRECT | O_SYNC, bypassing the page cache. In that case, there are many additional corner cases that can occur. Specifically, even if you use a shared clock source, and can show that the normal/page-cached write completed before the direct write call was made, it may still be possible for the page-cached write to overwrite the direct write contents.
And how does the kernel avoid it?
It doesn't. Why should it? POSIX says each write is atomic, but there is no practical way to avoid a race condition relying on that alone (and get consistent and expected results).
Userspace programs have at least four different methods to avoid such races:
Advisory file locks on the entire open file using the flock() interface.
Advisory file locks on the entire open file using the lockf() interface. In Linux, these are just shorthand for placing/removing fcntl() advisory locks on the entire file.
Advisory record locks on the file using the fcntl() interface. This works even across shared volumes, as long as the file server is configured to support file locking.
Obtaining an exclusive lease on the open file using the fcntl() interface.
Advisory file locks are like street lights: they are intended for co-operating processes to easily determine who gets to go when. However, they do not stop any other process from actually ignoring the "lock" and accessing the file.
File leases are a mechanism, where one or more processes can get a read lease at the same time on the same file, but only one process can get a write lease and only when that process is the only one having the file open. When granted, the write lease (or exclusive lease) means that if any other process tries to open the same file, the lease owner process is notified by a signal (that you can control using the fcntl() interface), and has a configured time (typically 45 seconds; see man 5 proc and /proc/sys/fs/lease-break-time, in seconds) to relinguish the lease. The opener is blocked in the kernel until the lease is downgraded or the lease break time passes, in which case the kernel breaks the lease.
This allows the lease holder to postpone the opening for a short while.
However, the lease holder cannot block the opening, and cannot e.g. replace the file with a decoy one; the opener already has a hold on the inode, and the lease break time is just a grace period for cleanup work.
Technically, a fifth method would be mandatory file locking, but aside from the kernel use wrt. executed binaries, they're not used, and are actually buggy in Linux anyway. In Linux, inodes are only locked against modification when that inode is being executed as a binary by the kernel. (You can still rename or delete the original file, and create a new one, so that any subsequent execs will execute the modified/new data. Attempts to modify a file that is being executed as a binary file will fail with error EBUSY.)
The POSIX specification for fcntl() states:
All locks associated with a file for a given process shall be removed when a file descriptor for that file is closed by that process or the process holding that file descriptor terminates.
Is this operation of unlocking the file segment locks that were held by a terminated process atomic per-file? In other words, if a process had locked byte segments B1..B2 and B3..B4 of a file but did not unlock the segments before terminating, when the system gets around to unlocking them, are segments B1..B2 and B3..B4 both unlocked before another fcntl() operation to lock a segment of the file can succeed? If not atomic per-file, does the order in which these file segments are unlocked by the system depend on the order in which the file segments were originally acquired?
The specification for fcntl() does not say, but perhaps there is a general provision in the POSIX specification that mandates a deterministic order on operations to clean up after a process that exits uncleanly or crashes.
There's a partial answer in section 2.9.7, Thread Interactions with Regular File Operations, of the POSIX specification:
All of the functions chmod(), close(), fchmod(), fcntl(), fstat(), ftruncate(), lseek(), open(), read(), readlink(), stat(), symlink(), and write() shall be atomic with respect to each other in the effects specified in IEEE Std 1003.1-2001 when they operate on regular files. If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
So, for a regular file, if a thread of a process holds locks on segments of a file and calls close() on the last file descriptor associated with the file, then the effects of close() (including removing all outstanding locks on the file that are held by the process) are atomic with respect to the effects of a call to fcntl() by a thread of another process to lock a segment of the file.
The specification for exit() states:
These functions shall terminate the calling process with the following consequences:
All of the file descriptors, directory streams[, conversion descriptors, and message catalog descriptors] open in the calling process shall be closed.
...
Presumably, open file descriptors are closed as if by appropriate calls to close(), but unfortunately the specification does not say how open file descriptors are "closed".
The 2004 specification seems even more vague when it comes to the steps of abnormal process termination. The only thing I could find is the documentation for abort(). At least with the 2008 specification, there is a section titled Consequences of Process Termination on the page for _Exit(). The wording, though, is still:
All of the file descriptors, directory streams, conversion descriptors, and message catalog descriptors open in the calling process shall be closed.
UPDATE: I just opened issue 0000498 in the Austin Group Defect Tracker.
I don't think the POSIX specification stipulates whether the releasing of locks is atomic or not, so you should assume that it behaves as inconveniently as possible for you. If you need them to be atomic, they aren't; if you need them to be handled separately, they're atomic; if you don't care, some machines will do it one way and other machines the other way. So, write your code so that it doesn't matter.
I'm not sure how you'd write code to detect the problem.
In practice, I expect that the locks would be released atomically, but the standard doesn't say, so you should not assume.
Does the OS handle it correctly?
Or will I have to call flock()?
Although the OS won't crash, and the filesystem won't be corrupted, calls to write() are NOT guarenteed to be atomic, unless the file descriptor in question is a pipe, and the amount of data to be written is PIPE_MAX bytes or less. The relevant part of the standard:
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
[...]
As such, in principle, you must lock with simultaneous writers, or your written data may get mixed up and out of order (even within the same write) or you may have multiple writes overwriting each other. However, there is an exception - if you pass O_APPEND, your writes will be effectively atomic:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
Although this is not necessarily atomic with respect to non-O_APPEND writes, or simultaneous reads, if all writers use O_APPEND, and you synchronize somehow before doing a read, you should be okay.
write (and writev, too) guarantee atomicity.
Which means if two threads or processes write simultaneously, you do not have a guarantee which one writes first. But you do have the guarantee that anything that is in one syscall will not be intermingled with data from the other one.
Insofar it will always work correctly, but not necessarily in the way you expect (if you assume that process A comes before process B).
Of course the kernel will handle it correctly, for the kernel’s idea of correctness — which is by definition correct.
If you have a set of coöperating flockers, then you can use the kernel to queue everyone up. But remember that flock has nothing to do with I/O: it will not stop someone else from writing the file. It will at most only interfere with other flockers.
Yes of course it will work correctly. It won't crash the OS or the process.
Whether it makes any sense, depends on the way the application(s) are written an what the file's purpose is.
If the file is opened by all processes as append-only, each process (notionally) does an atomic seek-to-end before each write; these are guaranteed not to overwrite each others' data (but of course, the order is nondeterministic).
In any case, if you use a library which potentially splits a single logical write into several write syscalls, expect trouble.
write(), writev(), read(), readv() can generate partial writes/reads where the amount of data transferred is smaller than what was requested.
Quoting the Linux man page for writev():
Note that is not an error for a successful call to transfer fewer bytes than requested
Quoting the POSIX man page:
If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written.
AFAIU, O_APPEND does not help in this regard because it does not prevent partial writes: it only ensures that whatever data is written is appended at the end of the file.
See this bug report from the Linux kernel:
A process is writing a messages to the file. [...] the writes [...] can be split in two. [...] So if the signal arrives [...] the write is interrupted. [...] this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned
FIFOs and PIPE writes smaller than PIPE_MAX however are guaranteed to be atomic.
Is there a problem with using pread on the same file descriptor from 2 or more different threads at the same time?
pread itself is thread-safe, since it is not on the list of unsafe functions. So it is safe to call it.
The real question is: what happens if you read from the same file concurrently (not necessarily from two threads, but also from two processes).
Regarding this, the specification says:
The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
Note that it doesn't mention ordinary files. This bit relates only to read anyway, because pread cannot be used on unseekable files.
I/O is intended to be atomic to ordinary files and pipes and FIFOs.
But this is from the non-normative section, so your OS might do it differently. E.g., if you read from two threads and there is a concurrent write, you might get different pieces of the write in your two read buffers. But this kind of problem is not specific to multithreading.
Also nice to know that in some cases
read() shall block the calling thread
Not the process, just the thread. And
A thread that has blocked shall not prevent any unblocked thread [...] from eventually making forward progress
As we are using same fd, we have to bind a lock otherwise there will be mix of data from the two pread on the file descriptor.
Hence yes there is a problem in doing this
http://linux.die.net/man/2/pread
I'm not 100% sure but I think that the file descriptor structure itself isn't thread safe, so two concurrent changes to it would corrupt it. You need some kind of locking.
Is writing to stdout using printf thread-safe on Linux? What about using the lower-level write command?
It's not specified by the C standard -- it depends on your implementation of the C standard library. In fact, the C standard doesn't even mention threads at all, since certain systems (e.g. embedded systems) don't have multithreading.
In the GNU implementation (glibc), most of the higher-level functions in stdio that deal with FILE* objects are thread-safe. The ones that aren't usually have unlocked in their names (e.g. getc_unlocked(3)). However, the thread safety is at a per-function call level: if you make multiple calls to printf(3), for example, each of those calls is guaranteed to output atomically, but other threads might print things out between your calls to printf(). If you want to ensure that a sequence of I/O calls gets output atomically, you can surround them with a pair of flockfile(3)/funlockfile(3) calls to lock the FILE handle. Note that these functions are reentrant, so you can safely call printf() in between them, and that won't result in deadlock even thought printf() itself makes a call to flockfile().
The low-level I/O calls such as write(2) should be thread-safe, but I'm not 100% sure of that - write() makes a system call into the kernel to perform I/O. How exactly this happens depends on what kernel you're using. It might be the sysenter instruction, or the int (interrupt) instruction on older systems. Once inside the kernel, it's up to the kernel to make sure that the I/O is thread-safe. In a test I just did with the Darwin Kernel Version 8.11.1, write(2) appears to be thread-safe.
Whether you'd call it "thread-safe" depends on your definition of thread-safe. POSIX requires stdio functions to use locking, so your program will not crash, corrupt the FILE object states, etc. if you use printf simultaneously from multiple threads. However, all stdio operations are formally specified in terms of repeated calls to fgetc and fputc, so there is no larger-scale atomicity guaranteed. That is to say, if threads 1 and 2 try to print "Hello\n" and "Goodbye\n" at the same time, there's no guarantee that the output will be either "Hello\nGoodbye\n" or "Goodbye\nHello\n". It could just as well be "HGelolodboy\ne\n". In practice, most implementations will acquire a single lock for the entire higher-level write call simply because it's more efficient, but your program should not assume so. There may be corner cases where this is not done; for instance an implementation could probably entirely omit locking on unbuffered streams.
Edit: The above text about atomicity is incorrect. POSIX guarantees all stdio operations are atomic, but the guarantee is hidden in the documentation for flockfile: http://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html
All functions that reference ( FILE *) objects shall behave as if they use flockfile() and funlockfile() internally to obtain ownership of these ( FILE *) objects.
You can use the flockfile, ftrylockfile, and funlockfile functions yourself to achieve larger-than-single-function-call atomic writes.
They are both thread-safe to the point that your application won't crash if multiple threads call them on the same file descriptor. However, without some application-level locking, whatever is written could be interleaved.
C got a new standard since this question was asked (and last answered).
C11 now comes with multithreading support and addresses multithreaded behavior of streams:
§7.21.2 Streams
¶7 Each stream has an associated lock that is used to prevent data races when multiple threads of execution access a stream, and to restrict the interleaving of stream operations performed by multiple threads. Only one thread may hold this lock at a time. The lock is reentrant: a single thread may hold the lock multiple times at a given time.
¶8 All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.
So, an implementation with C11 threads must guarantee that using printf is thread-safe.
Whether atomicity (as in no interleaving1) is guaranteed, wasn't that clear to me at a first glance, because the standard spoke of restricting interleaving, as opposed to preventing, which it mandated for data races.
I lean towards it being guaranteed. The standard speaks of restricting interleaving, as some interleaving that doesn't change the outcome is still allowed to happen; e.g. fwrite some bytes, fseek back some more and fwrite till the original offset, so that both fwrites are back-to-back. The implementation is free to reorder these 2 fwrites and merge them into a single write.
1: See the strike-through text in R..'s answer for an example.
It's thread-safe; printf should be reentrant, and you won't cause any strangeness or corruption in your program.
You can't guarantee that your output from one thread won't start half way through the output from another thread. If you care about that you need to develop your own locked output code to prevent multiple access.