Atomicity of `write(2)` to a local filesystem - c

Apparently POSIX states that
Either a file descriptor or a stream is called a "handle" on the
open file description to which it refers; an open file description
may have several handles. […] All activity by the application
affecting the file offset on the first handle shall be suspended
until it again becomes the active file handle. […] The handles need
not be in the same process for these rules to apply.
-- POSIX.1-2008
and
If two threads each call [the write() function], each call shall
either see all of the specified effects of the other call, or none
of them.
-- POSIX.1-2008
My understanding of this is that when the first process issues a
write(handle, data1, size1) and the second process issues
write(handle, data2, size2), the writes can occur in any order but
the data1 and data2 must be both pristine and contiguous.
But running the following code gives me unexpected results.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
die(char *s)
{
perror(s);
abort();
}
main()
{
unsigned char buffer[3];
char *filename = "/tmp/atomic-write.log";
int fd, i, j;
pid_t pid;
unlink(filename);
/* XXX Adding O_APPEND to the flags cures it. Why? */
fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);
if (fd < 0)
die("open failed");
for (i = 0; i < 10; i++) {
pid = fork();
if (pid < 0)
die("fork failed");
else if (! pid) {
j = 3 + i % (sizeof(buffer) - 2);
memset(buffer, i % 26 + 'A', sizeof(buffer));
buffer[0] = '-';
buffer[j - 1] = '\n';
for (i = 0; i < 1000; i++)
if (write(fd, buffer, j) != j)
die("write failed");
exit(0);
}
}
while (wait(NULL) != -1)
/* NOOP */;
exit(0);
}
I tried running this on Linux and Mac OS X 10.7.4 and using grep -a
'^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not
contiguous or overlap (Linux) or plain corrupted (Mac OS X).
Adding the flag O_APPEND in the open(2) call fixes this
problem. Nice, but I do not understand why. POSIX says
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
but this is not the problem here. My sample program never does
lseek(2) but share the same file description and thus same file
offset.
I have already read similar questions on Stackoverflow but they still
do not fully answer my question.
Atomic write on file from two process does not specifically
address the case where the processes share the same file description
(as opposed to the same file).
How does one programmatically determine if “write” system call is atomic on a particular file? says that
The write call as defined in POSIX has no atomicity guarantee at all.
But as cited above it does have some. And what’s more,
O_APPEND seems to trigger this atomicity guarantee although it seems
to me that this guarantee should be present even without O_APPEND.
Can you explain further this behaviour ?

man 2 write on my system sums it up nicely:
Note that not all file systems are POSIX conforming.
Here is a quote from a recent discussion on the ext4 mailing list:
Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.
This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

Edit: Updated Aug 2017 with latest changes in OS behaviours.
Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.
This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links ... [many functions] ... read() ...
write() ... If two threads each call one of these functions, each call
shall either see all of the specified effects of the other call, or
none of them. [Source]
and
Writes can be serialized with respect to other reads and writes. If a
read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. [Source]
but conversely:
This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control. [Source]
A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.
A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).
So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.
You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.

Some misinterpretation of what the standard mandates here comes from the use of processes vs. threads, and what that means for the "handle" situation you're talking about. In particular, you missed this part:
Handles can be created or destroyed by explicit user action, without affecting the underlying open file description. Some of the ways to create them include fcntl(), dup(), fdopen(), fileno(), and fork(). They can be destroyed by at least fclose(), close(), and the exec functions. [ ... ] Note that after a fork(), two handles exist where one existed before.
from the POSIX spec section you quote above. The reference to "create [ handles using ] fork" isn't elaborated on further in this section, but the spec for fork() adds a little detail:
The child process shall have its own copy of the parent's file descriptors. Each of the child's file descriptors shall refer to the same open file description with the corresponding file descriptor of the parent.
The relevant bits here are:
the child has copies of the parent's file descriptors
the child's copies refer to the same "thing" that the parent can access via said fds
file descriptors and file descriptions are not the same thing; in particular, a file descriptor is a handle in the above sense.
This is what the first quote refers to when it says "fork() creates [ ... ] handles" - they're created as copies, and therefore, from that point on, detached, and no longer updated in lockstep.
In your example program, every child process gets its very own copy which starts at the same state, but after the act of copying, these filedescriptors / handles have become independent instances, and therefore the writes race with each other. This is perfectly acceptable regarding the standard, because write() only guarentees:
On a regular file or other file capable of seeking, the actual writing of data shall proceed from the position in the file indicated by the file offset associated with fildes. Before successful return from write(), the file offset shall be incremented by the number of bytes actually written.
This means that while they all start the write at the same offset (because the fd copy was initialized as such) they might, even if successful, all write different amounts (there's no guarantee by the standard that a write request of N bytes will write exactly N bytes; it can succeed for anything 0 <= actual <= N), and due to the ordering of the writes being unspecified, the whole example program above therefore has unspecified results. Even if the total requested amount is written, all the standard above says that the file offset is incremented - it does not say it's atomically (once only) incremented, nor does it say that the actual writing of data will happen in an atomic fashion.
One thing is guaranteed though - you should never see anything in the file that has not either been there before any of the writes, or that had not come from either of the data written by any of the writes. If you do, that'd be corruption, and a bug in the filesystem implementation. What you've observed above might well be that ... if the final results can't be explained by re-ordering of parts of the writes.
The use of O_APPEND fixes this, because using that, again - see write(), does:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
which is the "prior to" / "no intervening" serializing behaviour that you seek.
The use of threads would change the behaviour partially - because threads, on creation, do not receive copies of the filedescriptors / handles but operate on the actual (shared) one. Threads would not (necessarily) all start writing at the same offset. But the option for partial-write-success will still means that you may see interleaving in ways you might not want to see. Yet it'd possibly still be fully standards-conformant.
Moral: Do not count on a POSIX/UNIX standard being restrictive by default. The specifications are deliberately relaxed in the common case, and require you as the programmer to be explicit about your intent.

You're misinterpreting the first part of the spec you cited:
Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply.
This does not place any requirements on the implementation to handle concurrent access. Instead, it places requirements on an application not to make concurrent access, even from different processes, if you want well-defined ordering of the output and side effects.
The only time atomicity is guaranteed is for pipes when the write size fits in PIPE_BUF.
By the way, even if the call to write were atomic for ordinary files, except in the case of writes to pipes that fit in PIPE_BUF, write can always return with a partial write (i.e. having written fewer than the requested number of bytes). This smaller-than-requested write would then be atomic, but it wouldn't help the situation at all with regards to atomicity of the entire operation (your application would have to re-call write to finish).

Related

is there an official document that mark read/write function as thread-safe functions?

the man pages of read/write didn't mention anything about their thread-safety
According to this link!
i understood this functions are thread safe but in this comment there is not a link to an official document.
In other hand according to this link! which says:
The read() function shall attempt to read nbyte bytes
from the file associated with the open file descriptor,
fildes, into the buffer pointed to by buf.
The behavior of multiple concurrent reads on the same pipe, FIFO, or
terminal device is unspecified.
I concluded the read function is not thread safe.
I am so confused now. please send me a link to official document about thread-safety of this functions.
i tested this functions with pipe but there wasn't any problem.(of course i know i couldn't state any certain result by testing some example)
thanks in advance:)
The thread safe versions of read and write are pread and pwrite:
pread(2)
The pread() and pwrite() system calls are especially useful in
multithreaded applications. They allow multiple threads to perform
I/O on the same file descriptor without being affected by changes to
the file offset by other threads.
when two threads write() at the same time the order is not specified (which write call completes first) therefore the behaviour is unspecified (without synchronization)
read() and write() are not strictly thread-safe, and there is no documentation that says they are, as the location where the data is read from or written to can be modified by another thread.
Per the POSIX read documentation (note the bolded parts):
The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer pointed to by buf. The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
That's the part you noticed - but that does not cover all possible types of file descriptors, such as regular files. It only applies to "pipe[s], FIFO[s]" and "terminal device[s]". This part covers almost everything else (weird things like "files" in /proc that are generated on the fly by the kernel are, well, weird and highly implementation-specific):
On files that support seeking (for example, a regular file), the read() shall start at a position in the file given by the file offset associated with fildes. The file offset shall be incremented by the number of bytes actually read.
Since the "file offset associated with fildes" is subject to modification from other threads in the process, the following code is not guaranteed to return the same results even given the exact same file contents and inputs for fd, offset, buffer, and bytes:
lseek( fd, offset, SEEK_SET );
read( fd, buffer, bytes );
Since both read() and write() depend upon a state (current file offset) that can be modified at any moment by another thread, they are not tread-safe.
On some embedded file systems, or really old desktop systems that weren't designed to facilitate multitasking support (e.g. MS-DOS 3.0), an attempt to perform an fread() on one file while an fread() is being performed on another file may result in arbitrary system corruption.
Any modern operating system and language runtime will guarantee that such corruption won't occur as a result of operations performed on unrelated files, or when independent file descriptors are used to access the same file in ways that do not modify it. Functions like fread() and fwrite() will be thread-safe when used in that fashion.
The act of reading data from a disk file does not modify it, but reading data from many kinds of stream will modify them by removing data. If two threads both perform actions that modify the same stream, such actions may interfere with each other in unspecified ways even if such modifications are performed by fread() operations.

Atomicity of `write(2)` on a file opened with the `O_APPEND` flag [duplicate]

Apparently POSIX states that
Either a file descriptor or a stream is called a "handle" on the
open file description to which it refers; an open file description
may have several handles. […] All activity by the application
affecting the file offset on the first handle shall be suspended
until it again becomes the active file handle. […] The handles need
not be in the same process for these rules to apply.
-- POSIX.1-2008
and
If two threads each call [the write() function], each call shall
either see all of the specified effects of the other call, or none
of them.
-- POSIX.1-2008
My understanding of this is that when the first process issues a
write(handle, data1, size1) and the second process issues
write(handle, data2, size2), the writes can occur in any order but
the data1 and data2 must be both pristine and contiguous.
But running the following code gives me unexpected results.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
die(char *s)
{
perror(s);
abort();
}
main()
{
unsigned char buffer[3];
char *filename = "/tmp/atomic-write.log";
int fd, i, j;
pid_t pid;
unlink(filename);
/* XXX Adding O_APPEND to the flags cures it. Why? */
fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);
if (fd < 0)
die("open failed");
for (i = 0; i < 10; i++) {
pid = fork();
if (pid < 0)
die("fork failed");
else if (! pid) {
j = 3 + i % (sizeof(buffer) - 2);
memset(buffer, i % 26 + 'A', sizeof(buffer));
buffer[0] = '-';
buffer[j - 1] = '\n';
for (i = 0; i < 1000; i++)
if (write(fd, buffer, j) != j)
die("write failed");
exit(0);
}
}
while (wait(NULL) != -1)
/* NOOP */;
exit(0);
}
I tried running this on Linux and Mac OS X 10.7.4 and using grep -a
'^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not
contiguous or overlap (Linux) or plain corrupted (Mac OS X).
Adding the flag O_APPEND in the open(2) call fixes this
problem. Nice, but I do not understand why. POSIX says
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
but this is not the problem here. My sample program never does
lseek(2) but share the same file description and thus same file
offset.
I have already read similar questions on Stackoverflow but they still
do not fully answer my question.
Atomic write on file from two process does not specifically
address the case where the processes share the same file description
(as opposed to the same file).
How does one programmatically determine if “write” system call is atomic on a particular file? says that
The write call as defined in POSIX has no atomicity guarantee at all.
But as cited above it does have some. And what’s more,
O_APPEND seems to trigger this atomicity guarantee although it seems
to me that this guarantee should be present even without O_APPEND.
Can you explain further this behaviour ?
man 2 write on my system sums it up nicely:
Note that not all file systems are POSIX conforming.
Here is a quote from a recent discussion on the ext4 mailing list:
Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.
This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.
Edit: Updated Aug 2017 with latest changes in OS behaviours.
Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.
This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links ... [many functions] ... read() ...
write() ... If two threads each call one of these functions, each call
shall either see all of the specified effects of the other call, or
none of them. [Source]
and
Writes can be serialized with respect to other reads and writes. If a
read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. [Source]
but conversely:
This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control. [Source]
A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.
A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).
So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.
You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.
Some misinterpretation of what the standard mandates here comes from the use of processes vs. threads, and what that means for the "handle" situation you're talking about. In particular, you missed this part:
Handles can be created or destroyed by explicit user action, without affecting the underlying open file description. Some of the ways to create them include fcntl(), dup(), fdopen(), fileno(), and fork(). They can be destroyed by at least fclose(), close(), and the exec functions. [ ... ] Note that after a fork(), two handles exist where one existed before.
from the POSIX spec section you quote above. The reference to "create [ handles using ] fork" isn't elaborated on further in this section, but the spec for fork() adds a little detail:
The child process shall have its own copy of the parent's file descriptors. Each of the child's file descriptors shall refer to the same open file description with the corresponding file descriptor of the parent.
The relevant bits here are:
the child has copies of the parent's file descriptors
the child's copies refer to the same "thing" that the parent can access via said fds
file descriptors and file descriptions are not the same thing; in particular, a file descriptor is a handle in the above sense.
This is what the first quote refers to when it says "fork() creates [ ... ] handles" - they're created as copies, and therefore, from that point on, detached, and no longer updated in lockstep.
In your example program, every child process gets its very own copy which starts at the same state, but after the act of copying, these filedescriptors / handles have become independent instances, and therefore the writes race with each other. This is perfectly acceptable regarding the standard, because write() only guarentees:
On a regular file or other file capable of seeking, the actual writing of data shall proceed from the position in the file indicated by the file offset associated with fildes. Before successful return from write(), the file offset shall be incremented by the number of bytes actually written.
This means that while they all start the write at the same offset (because the fd copy was initialized as such) they might, even if successful, all write different amounts (there's no guarantee by the standard that a write request of N bytes will write exactly N bytes; it can succeed for anything 0 <= actual <= N), and due to the ordering of the writes being unspecified, the whole example program above therefore has unspecified results. Even if the total requested amount is written, all the standard above says that the file offset is incremented - it does not say it's atomically (once only) incremented, nor does it say that the actual writing of data will happen in an atomic fashion.
One thing is guaranteed though - you should never see anything in the file that has not either been there before any of the writes, or that had not come from either of the data written by any of the writes. If you do, that'd be corruption, and a bug in the filesystem implementation. What you've observed above might well be that ... if the final results can't be explained by re-ordering of parts of the writes.
The use of O_APPEND fixes this, because using that, again - see write(), does:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
which is the "prior to" / "no intervening" serializing behaviour that you seek.
The use of threads would change the behaviour partially - because threads, on creation, do not receive copies of the filedescriptors / handles but operate on the actual (shared) one. Threads would not (necessarily) all start writing at the same offset. But the option for partial-write-success will still means that you may see interleaving in ways you might not want to see. Yet it'd possibly still be fully standards-conformant.
Moral: Do not count on a POSIX/UNIX standard being restrictive by default. The specifications are deliberately relaxed in the common case, and require you as the programmer to be explicit about your intent.
You're misinterpreting the first part of the spec you cited:
Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply.
This does not place any requirements on the implementation to handle concurrent access. Instead, it places requirements on an application not to make concurrent access, even from different processes, if you want well-defined ordering of the output and side effects.
The only time atomicity is guaranteed is for pipes when the write size fits in PIPE_BUF.
By the way, even if the call to write were atomic for ordinary files, except in the case of writes to pipes that fit in PIPE_BUF, write can always return with a partial write (i.e. having written fewer than the requested number of bytes). This smaller-than-requested write would then be atomic, but it wouldn't help the situation at all with regards to atomicity of the entire operation (your application would have to re-call write to finish).

Will multi-thread do write() interleaved

If I have two threads, thread0 and thread1.
thread0 does:
const char *msg = "thread0 => 0000000000\n";
write(fd, msg, strlen(msg));
thread1 does:
const char *msg = "thread1 => 111111111\n";
write(fd, msg, strlen(msg));
Will the output interleave? E.g.
thread0 => 000000111
thread1 => 111111000
First, note that your question is "Will data be interleaved?", not "Are write() calls [required to be] atomic?" Those are different questions...
"TL;DR" summary:
write() to a pipe or FIFO less than or equal to PIPE_BUF bytes won't be interleaved
write() calls to anything else will be somewhere in the range between "probably won't be interleaved" to "won't ever be interleaved" with the majority of implementations in the "almost certainly won't be interleaved" to "won't ever be interleaved" range.
Full Answer
If you're writing to a pipe or FIFO, your data will not be interleaved at all for write() calls for PIPE_BUF or less bytes.
Per the POSIX standard for write() (note the bolded part):
RATIONALE
...
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
...
Applicability of POSIX standards to Windows systems, however, is debatable at best.
So, for pipes or FIFOs, data won't be interleaved up to PIPE_BUF bytes.
How does that apply to files?
First, file append operations have to be atomic. Per that same POSIX standard (again, note the bolded part):
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Also see Is file append atomic in UNIX?
So how does that apply to non-append write() calls?
Commonality of implementation. See the Linux read/write syscall implementations for an example. (Note that the "problem" is handed directly to the VFS implementation, though, so the answer might also be "It might very well depend on your file system...")
Most implementations of the write() system call inside the kernel are going to use the same code to do the actual data write for both append mode and "normal" write() calls - and for pwrite() calls, too. The only difference will be the source of the offset used - for "normal" write() calls the offset used will be the current file offset. For append write() calls the offset used will be the current end of the file. For pwrite() calls the offset used will be supplied by the caller (except that Linux is broken - it uses the current file size instead of the supplied offset parameter as the target offset for pwrite() calls on files opened in append mode. See the "BUGS" section of the Linux pwrite() man page.)
So appending data has to be atomic, and that same code will almost certainly be used for non-append write() calls in all implementations.
But the "write operation" in the append-must-be-atomic requirement is allowed to return less than the total number of bytes requested:
The write() function shall attempt to write nbyte bytes ...
Partial write() results are allowed even in append operations. But even then, the data that does get written must be written atomically.
What are the odds of a partial write()? That depends on what you're writing to. I've never seen a partial write() result to a file outside of the disk filling up or an actual hardware failure. Or even a partial read() result. I can't see any way for a write() operation that has all its data on a single page in kernel memory resulting in a partial write() in anything other than a disk full or hardware failure situation.
If you look at Is file append atomic in UNIX? again, you'll see that actual testing shows that append write() operations are in fact atomic.
So the answer to "Will multi-thread do write() interleaved?" is, "No, the data will almost certainly not be interleaved for writes that are at or under 4KB (page size) as long as the data does not cross a page boundary in kernel space." And even crossing a page boundary probably doesn't change the odds all that much.
If you're writing small chunks of data, it depends on your willingness to deal with the almost-certain-to-never-happen-but-it-might-anyway result of interleaved data. If it's a text log file, I'd opine that it won't matter anyway.
And note that it's not likely to be any faster to use multiple threads to write to the same file - the kernel is likely going to lock things and effectively single-thread the actual write() calls anyway to ensure it can meet the atomicity requirements of writing to a pipe and appending to a file.

Is write() safe to be called from multiple threads simultaneously?

Assuming I have opened dev/poll as mDevPoll, is it safe for me to call code like this
struct pollfd tmp_pfd;
tmp_pfd.fd = fd;
tmp_pfd.events = POLLIN;
// Write pollfd to /dev/poll
write(mDevPoll, &tmp_pfd, sizeof(struct pollfd));
...simultaneously from multiple threads, or do I need to add my own synchronisation primitive around mDevPoll?
Solaris 10 claims to be POSIX compliant. The write() function is not among the handful of system interfaces that POSIX permits to be non-thread-safe, so we can conclude that that on Solaris 10, it is safe in a general sense to call write() simultaneously from two or more threads.
POSIX also designates write() among those functions whose effects are atomic relative to each other when they operate on regular files or symbolic links. Specifically, it says that
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
If your writes were directed to a regular file then that would be sufficient to conclude that your proposed multi-thread actions are safe, in the sense that they would not interfere with one another, and the data written in one call would not be commingled with that written by a different call in any thread. Unfortunately, /dev/poll is not a regular file, so that does not apply directly to you.
You should also be aware that write() is not in general required to transfer the full number of bytes specified in a single call. For general purposes, one must therefore be prepared to transfer the desired bytes over multiple calls, by using a loop. Solaris may provide applicable guarantees beyond those expressed by POSIX, perhaps specific to the destination device, but absent such guarantees it is conceivable that one of your threads performs a partial write, and the next write is performed by a different thread. That very likely would not produce the results you want or expect.
It's not safe in theory, even though write() is completely thread-safe (barring implementation bugs...). Per the POSIX write() standard (emphasis mine):
.
The write() function shall attempt to write nbyte bytes from the
buffer pointed to by buf to the file associated with the open file
descriptor, fildes.
...
RETURN VALUE
Upon successful completion, these functions shall return the number of bytes actually written ...
There is no guarantee that you won't get a partial write(), so even if each individual write() call is atomic, it's not necessarily complete, so you could still get interleaved data because it may take more than one call to write() to completely write all data.
In practice, if you're only doing relatively small write() calls, you will likely never see a partial write(), with "small" and "likely" being indeterminate values dependent on your implementation.
I've routinely delivered code that uses unlocked single write() calls on regular files opened with O_APPEND in order to improve the performance of logging - build a log entry then write() the entire entry with one call. I've never seen a partial or interleaved write() result over almost a couple of decades of doing that on Linux and Solaris systems, even when many processes write to the same log file. But then again, it's a text log file and if a partial or interleaved write() does happen there would be no real damage done or even data lost.
In this case, though, you're "writing" a handful of bytes to a kernel structure. You can dig through the Solaris /dev/poll kernel driver source code at Illumos.org and see how likely a partial write() is. I'd suspect it's practically impossible - because I just went back and looked at the multiplatform poll class that I wrote for my company's software library a decade ago. On Solaris it uses /dev/poll and unlocked write() calls from multiple threads. And it's been working fine for a decade...
Solaris /dev/pool Device Driver Source Code Analysis
The (Open)Solaris source code can be found here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/devpoll.c#628
The dpwrite() function is the code in the /dev/poll driver that actually performs the "write" operation. I use quotes because it's not really a write operation at all - data isn't transferred as much as the data in the kernel that represents the set of file descriptors being polled is updated.
Data is copied from user space into kernel space - to a memory buffer obtained with kmem_alloc(). I don't see any possible way that can be a partial copy. Either the allocation succeeds or it doesn't. The code can get interrupted before doing anything, as it wait for exclusive write() access to the kernel structures.
After that, the last return call is at the end - and if there's no error, the entire call is marked successful, or the entire call fails on any error:
995 if (error == 0) {
996 /*
997 * The state of uio_resid is updated only after the pollcache
998 * is successfully modified.
999 */
1000 uioskip(uiop, copysize);
1001 }
1002 return (error);
1003}
If you dig through Solaris kernel code, you'll see that uio_resid is what ends up being the value returned by write() after a successful call.
So the call certainly appears to be all-or-nothing. While there appear to be ways for the code to return an error on a file descriptor after successfully processing an earlier descriptor when multiple descriptors are passed in, the code doesn't appear to return any partial success indications.
If you're only processing one file descriptor at a time, I'd say the /dev/poll write() operation is completely thread-safe, and it's almost certainly thread-safe for "writing" updates to multiple file descriptors as there's no apparent way for the driver to return a partial write() result.

how standard specify atomic write to regular file(not pipe or fifo)?

The posix standard specified that when write less than PIPE_BUF bytes to pipe or FIFO are granted atomic, that is, our write doesn't mix with other processes'. But I failed to find out how standard specify about regular file. I mean it's true that when we write less than PIPE_BUF, it will also granted be atomic. But I want to know does regular file have such limitation? I mean, the pipe has the capacity, so that when write to the pipe and beyond its capacity, kernel will put the writer to sleep, so other process will get chance to write, but regular file seems that doesn't have to have such limitation, am i right?
What I'm doing is several processes generate log to a file. Of course, with O_APPEND set.
Quote from http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm (Single UNIX Specification, Version 4, 2010 Edition):
This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control.
The specification does address how semantics of writes regarding writes occur in case of multiple readers, but as you can see from above, the behaviour for multiple, concurrent writers is not defined by the specification.
Note above talks about files. For pipes and FIFOs the PIPE_MAX semantics apply, that concurrent writes are guaranteed to be non-divisible up to PIPE_MAX bytes.
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
For real file systems the situation is complex. Some local file systems may enforce atomic writes up to arbitrary sizes (memory limit) by locking a file handle during writing, some might not (I tried to look at ext4 logic, but lost track somewhere around http://lxr.linux.no/linux+v3.5.3/fs/jbd2/transaction.c#L147).
For non-local file systems the result is more or less for grabs. Just don't try concurrent writing on a networked file system without some form of explicit locking (or you're positively absolutely sure about the semantics of the network file system you're using).
BTW, O_APPEND guarantees that all writes by different processes go to the end of the file. However as SUS above notes, if the writes are really concurrent (occuring at the same time), then the behavior is undefined. On earlier uniprocess and non-pre-emptive UNIXes this didn't really matter, as a call to write(2) completed before someone else got a chance to write...
This question could be answered definitely for specific combination of operating system (Linux?) and file system (ext4?). A general answer? As SUS reads -- "undefined behavior".
I think this is useful to you: "the data written by writev() is written as a single block that is not intermingled with output from writes in other processes", so you can use writev
Several writers to a file may mix up things. But files opened with O_APPEND are appended atomically per write access.
If you want to keep to the C stdio interface instead of the lower level one, fopene the file with "a" or "a+" (which map to O_APPEND), set up a buffer large enough that there is no need to write inside your records and use fsync to force the write when you are done building them. I'm not sure it is guaranteed by POSIX (C says nothing about that).
There is the ultimate solut8ion to all questions of atomicity; a mutex. Wrap your writes to the log file in a mutex and all will be done atomically.
A simpler solution might be to use the GLOG libraries from Google. A fantastic logging system, far better than anything I ever dreamed up, free, not-GPL, and atomic.
One way to interleave them safely would be to have all writers lock the file, write, and unlock.
Functions that can be used for locking are flock(), lockf(), and fcntl().
Beware that ALL writers must lock (and they should all use the same mechanism to do the locking) or one that doesn't bother getting a lock could still write at the same time as another that holds a lock.

Resources