how to (f)sync a directory under linux in c - c

I've some c application under linux. I'm renaming some files with rename(...)
How can I ensure that the renaming is written persistent to the underlaying disk?
With a file I can do something like:
FILE * f = fopen("foo","w");
...
fflush(f);
fsync(fileno(f));
fclose(f);
How can I fsync (or similar) a directory after a rename() in c?

This is how you can do what you want:
#include <fcntl.h>
int fd = open('/path/to/dir', O_RDONLY);
fsync(fd);
Don't forget to close the fd file descriptor when no longer needed of course.
Contrary to some misconceptions, the atomicity of rename() does not guarantee the file will be persisted to disk. The atomicity guarantee only ensures that the metadata in the file system buffers is in a consistent state but not that it has been persisted to disk.

rename() is atomic (on linux), so I don't think you need to worry about that
Atomicity is typically guaranteed in operations involving filename handling ; for example, for rename, “specification requires that the action of the function be atomic” – that is, when renaming a file from the old name to the new one, at no circumstances should you ever see the two files at the same time.
a power outage in the middle of a rename() operation shall not leave the filesystem in a “weird” state, with the filename being unreachable because its metadata has been corrupted. (ie. either the operation is lost, or the operation is committed.)
Source
So, I think you should only be worried about error value.
If you really want to be safe, fsync() also flush metadata (on linux), so you could fsync the directory and the file you want to be sure there are present on the disk.

According to the manual, at the return of the function, rename has been done effectively (return 0) or an error occured (return -1) and errno is set to check what's wrong.
If you want the system to apply the potential pending modifications only on this file after rename you can do :
int fd = open(new_name, O_RDONLY);
syncfs(fd);

Related

Atomicity of `write(2)` on a file opened with the `O_APPEND` flag [duplicate]

Apparently POSIX states that
Either a file descriptor or a stream is called a "handle" on the
open file description to which it refers; an open file description
may have several handles. […] All activity by the application
affecting the file offset on the first handle shall be suspended
until it again becomes the active file handle. […] The handles need
not be in the same process for these rules to apply.
-- POSIX.1-2008
and
If two threads each call [the write() function], each call shall
either see all of the specified effects of the other call, or none
of them.
-- POSIX.1-2008
My understanding of this is that when the first process issues a
write(handle, data1, size1) and the second process issues
write(handle, data2, size2), the writes can occur in any order but
the data1 and data2 must be both pristine and contiguous.
But running the following code gives me unexpected results.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
die(char *s)
{
perror(s);
abort();
}
main()
{
unsigned char buffer[3];
char *filename = "/tmp/atomic-write.log";
int fd, i, j;
pid_t pid;
unlink(filename);
/* XXX Adding O_APPEND to the flags cures it. Why? */
fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);
if (fd < 0)
die("open failed");
for (i = 0; i < 10; i++) {
pid = fork();
if (pid < 0)
die("fork failed");
else if (! pid) {
j = 3 + i % (sizeof(buffer) - 2);
memset(buffer, i % 26 + 'A', sizeof(buffer));
buffer[0] = '-';
buffer[j - 1] = '\n';
for (i = 0; i < 1000; i++)
if (write(fd, buffer, j) != j)
die("write failed");
exit(0);
}
}
while (wait(NULL) != -1)
/* NOOP */;
exit(0);
}
I tried running this on Linux and Mac OS X 10.7.4 and using grep -a
'^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not
contiguous or overlap (Linux) or plain corrupted (Mac OS X).
Adding the flag O_APPEND in the open(2) call fixes this
problem. Nice, but I do not understand why. POSIX says
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
but this is not the problem here. My sample program never does
lseek(2) but share the same file description and thus same file
offset.
I have already read similar questions on Stackoverflow but they still
do not fully answer my question.
Atomic write on file from two process does not specifically
address the case where the processes share the same file description
(as opposed to the same file).
How does one programmatically determine if “write” system call is atomic on a particular file? says that
The write call as defined in POSIX has no atomicity guarantee at all.
But as cited above it does have some. And what’s more,
O_APPEND seems to trigger this atomicity guarantee although it seems
to me that this guarantee should be present even without O_APPEND.
Can you explain further this behaviour ?
man 2 write on my system sums it up nicely:
Note that not all file systems are POSIX conforming.
Here is a quote from a recent discussion on the ext4 mailing list:
Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.
This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.
Edit: Updated Aug 2017 with latest changes in OS behaviours.
Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.
This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links ... [many functions] ... read() ...
write() ... If two threads each call one of these functions, each call
shall either see all of the specified effects of the other call, or
none of them. [Source]
and
Writes can be serialized with respect to other reads and writes. If a
read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. [Source]
but conversely:
This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control. [Source]
A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.
A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).
So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.
You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.
Some misinterpretation of what the standard mandates here comes from the use of processes vs. threads, and what that means for the "handle" situation you're talking about. In particular, you missed this part:
Handles can be created or destroyed by explicit user action, without affecting the underlying open file description. Some of the ways to create them include fcntl(), dup(), fdopen(), fileno(), and fork(). They can be destroyed by at least fclose(), close(), and the exec functions. [ ... ] Note that after a fork(), two handles exist where one existed before.
from the POSIX spec section you quote above. The reference to "create [ handles using ] fork" isn't elaborated on further in this section, but the spec for fork() adds a little detail:
The child process shall have its own copy of the parent's file descriptors. Each of the child's file descriptors shall refer to the same open file description with the corresponding file descriptor of the parent.
The relevant bits here are:
the child has copies of the parent's file descriptors
the child's copies refer to the same "thing" that the parent can access via said fds
file descriptors and file descriptions are not the same thing; in particular, a file descriptor is a handle in the above sense.
This is what the first quote refers to when it says "fork() creates [ ... ] handles" - they're created as copies, and therefore, from that point on, detached, and no longer updated in lockstep.
In your example program, every child process gets its very own copy which starts at the same state, but after the act of copying, these filedescriptors / handles have become independent instances, and therefore the writes race with each other. This is perfectly acceptable regarding the standard, because write() only guarentees:
On a regular file or other file capable of seeking, the actual writing of data shall proceed from the position in the file indicated by the file offset associated with fildes. Before successful return from write(), the file offset shall be incremented by the number of bytes actually written.
This means that while they all start the write at the same offset (because the fd copy was initialized as such) they might, even if successful, all write different amounts (there's no guarantee by the standard that a write request of N bytes will write exactly N bytes; it can succeed for anything 0 <= actual <= N), and due to the ordering of the writes being unspecified, the whole example program above therefore has unspecified results. Even if the total requested amount is written, all the standard above says that the file offset is incremented - it does not say it's atomically (once only) incremented, nor does it say that the actual writing of data will happen in an atomic fashion.
One thing is guaranteed though - you should never see anything in the file that has not either been there before any of the writes, or that had not come from either of the data written by any of the writes. If you do, that'd be corruption, and a bug in the filesystem implementation. What you've observed above might well be that ... if the final results can't be explained by re-ordering of parts of the writes.
The use of O_APPEND fixes this, because using that, again - see write(), does:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
which is the "prior to" / "no intervening" serializing behaviour that you seek.
The use of threads would change the behaviour partially - because threads, on creation, do not receive copies of the filedescriptors / handles but operate on the actual (shared) one. Threads would not (necessarily) all start writing at the same offset. But the option for partial-write-success will still means that you may see interleaving in ways you might not want to see. Yet it'd possibly still be fully standards-conformant.
Moral: Do not count on a POSIX/UNIX standard being restrictive by default. The specifications are deliberately relaxed in the common case, and require you as the programmer to be explicit about your intent.
You're misinterpreting the first part of the spec you cited:
Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply.
This does not place any requirements on the implementation to handle concurrent access. Instead, it places requirements on an application not to make concurrent access, even from different processes, if you want well-defined ordering of the output and side effects.
The only time atomicity is guaranteed is for pipes when the write size fits in PIPE_BUF.
By the way, even if the call to write were atomic for ordinary files, except in the case of writes to pipes that fit in PIPE_BUF, write can always return with a partial write (i.e. having written fewer than the requested number of bytes). This smaller-than-requested write would then be atomic, but it wouldn't help the situation at all with regards to atomicity of the entire operation (your application would have to re-call write to finish).

Equivalent of fgetc with Unix file descriptors

The fgetc(3) function takes a FILE * as its input stream. Must I reimplement character-at-a-time input with read(2), or is there a <unistd.h>-style equivalent taking an integer file descriptor instead?
No, there isn't such a thing, and please never do read(fd, &ch, sizeof(char)) (explanations below).
The function read(2) is usually implemented as a system call to the operating system kernel. Although the internal (and funky) details of such a thing shall not be discused here, the overall idea is that system calls are (usually) not something cheap.
It would be inefficient for both the userspace application and the kernel to do a system call just to get a single character from a file descriptor.
For instance, fgetc(3) usually ends up doing some buffering inside the structure of the FILE object. This means that the internal read(2) from fgetc(3) wouldn't just read a single character, but rather it'll try to get more for the sake of efficiency.
Anyway, it's not usually a good idea to mess up with such low-level stuff. You can get all the benefits of buffering (and of FILEs overall) by using fdopen(3) to create a FILE object from a file descriptor, as your question appears to imply that you have at hand just a raw file descriptor at the moment.
If you want to, you can open a file using open() -
int fh = open("abc.txt", O_RDONLY, S_IREAD); // there are different permissions you can provide (refer to link).
and then you can use fh in read() calls.

mkstemp() - is it safe to close descriptor and reopen it again?

When generating a temporary file name using mkstemp(), is it safe to immediately call close() on the file descriptor returned by mkstemp(), store the file name generated by mkstemp() somewhere and use it (at a much later time) to open the file again for writing a temporary file? Or will this temporary file name become available again as soon as I call close() on it?
The reason why I'm asking is that I'm wondering why mkstemp() returns a file descriptor at all. If it is safe to close() the descriptor immediately, why does it return a descriptor at all? mkstemp() could close it then on its own and just give me a file name.
No. In between the time when you use mkstemp() to create the file and the time when you reopen it, your adversary may have removed the file you created and put a symlink in its place pointing to somewhere else altogether. This is a TOCTOU — Time of Check, Time of Use — vulnerability which the use of mkstemp() largely avoids, provided you keep the file descriptor open.
Once you close the file descriptor, all bets are off in a sufficiently hostile environment.
Note that even if you keep the file descriptor open, an adversary might remove the file, or rename it, and then create their own file (symlink, directory) in its place. The file descriptor remains valid. You could use stat() to get the name information and the fstat() to get the file descriptor information, and if the two match (st_dev and st_ino fields), then you're probably still OK. If they differ, someone's got at the file — if you rename it, you may be renaming their file rather than the one you created.
While the file originally created by mkstemp() still exists, the name will not be regenerated. In general, successive calls to mkstemp() will create distinct names anyway, but the name is guaranteed to be unique at the moment of creation (see the O_EXCL flag for open()).
And just in case you're wondering, no — there isn't a way to associate a name with a file descriptor (there is no hypothetical int flink(int fd, const char *name) system call). There was a question about that on one of the Stack Exchange sites a while ago, and the answer was definitely negative, with references to the Linux Kernel mailing list and so on. One such question is Is it possible to recreate a file from an opened file descriptor?, but I think there was a more thorough version of the question too.
The mkstemp function specifically uses descriptors instead of filenames to avoid race conditions that are commonly associated with its predecessors such as mktemp. In fact, the "s" in "mkstemp" means "secure", because the race condition can be a source of vulnerability (e.g. if you use the temporary file to store JIT code, for example, and someone guessing/stomping the file before you open it could cause your application to load/run the provided code rather than the code that your program generates).
Once you close the descriptor, nothing prevents another application from writing a file with the same name, so please don't do that. You should retain the descriptor for as long as the temporary file is needed (and close the descriptor once the temporary file is no longer going to be used by your program).

using fwrite as an atomic process on Linux

I am developing a C code on Linux environment. I use fwrite to write some data to some files. The program will be run on an environment that power cut offs occur often (at least once a day). Therefore, I want fwrite to ensure that the file should not be updated if a power cut occurs while it is writing data. It should only save the file when the fwrite finishes its job. How can I use fwrite that effects the file only it finishes the writing process?
EDIT: I use fopen with wb to discard the previous info in the file and write a new file e.g.
FILE *rtng_p;
rtng_p = fopen("/etc/routing_table", "wb");
fwrite(&user_list, sizeof(struct routing), 40, rtng_p);
and it is a very small data some bytes long
First write the file to a temporary path on the same filesystem, like /etc/routing_table.tmp. Then just rename the copy on top of original file. Renames are guaranteed atomic.
So, the sequence of calls would be, fopen, fwrite, fclose, rename.
In addition of the sequence given in David Schwartz answer you could perhaps use advisory locks with e.g. flock(2) syscall (or maybe lockf(3) i.e. fcntl(2) with F_SETLK ....)
That would mean to add, just after
FILE * fil = fopen("/etc/routing_table.tmp", "wb");
the lines
if (!fil)
{ perror("/etc/routing_table.tmp"); exit(EXIT_FAILURE); };
if (flock(fileno(fil), LOCK_EX))
{ perror("flock LOCK_EX"); exit(EXIT_FAILURE); };
and at the end, you would
if (fflush(fil)) /* flush the file before unlocking it!!*/
{ perror("fflush"); exit(EXIT_FAILURE); };
if (flock(fileno(fil), LOCK_UN))
{ perror("flock LOCK_UN"); exit(EXIT_FAILURE); };
if (fclose (fil))
{ perror("fclose"); exit(EXIT_FAILURE); };;
if (rename("/etc/routing_table.tmp", "/etc/routing_table"))
{ perror("rename"); exit(EXIT_FAILURE); };
Using such advisory locking would ensure that even if two processes of your program are running, only one would write the file.
But it is overkill probably.
BTW, you seems to write binary data in /etc/. I believe it is against the habits or the conventions (see Linux Filesystem Hierarchy, or Linux Standard Base). I expect files under /etc to be textual. Perhaps you want your file under /var/lib ?
See also Advanced Linux Programming book online.
There has been a large argument going on in the UNIX/Linux community about the whether the open/write/close/rename pattern (as described in David Schwartz's answer) is actually guaranteed to be atomic. Note this conversation is about write and not fwrite!
The primary author of the EXT4 filesystem did not believe that it should be guaranteed according to POSIX and early versions of the filesystem did not treat it as atomic. Eventually he capitulated and made that set of operations atomic as the default behavior for EXT4. The claim was made, however, that user programs should actually be doing open/write/fsync/close/rename.
Other filesystems may not guarantee atomicity without the fsync, and if EXT4 is mounted with noauto_da_alloc then that guarantee is lost there as well. So if you want to be really safe you should add fsync after close before the rename. I haven't tried this with fwrite it might work if you use fflush.
See the auto_da_alloc section at https://www.kernel.org/doc/Documentation/filesystems/ext4.txt for more information. Also see an article written by the primary author of EXT4 here: http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

Atomicity of `write(2)` to a local filesystem

Apparently POSIX states that
Either a file descriptor or a stream is called a "handle" on the
open file description to which it refers; an open file description
may have several handles. […] All activity by the application
affecting the file offset on the first handle shall be suspended
until it again becomes the active file handle. […] The handles need
not be in the same process for these rules to apply.
-- POSIX.1-2008
and
If two threads each call [the write() function], each call shall
either see all of the specified effects of the other call, or none
of them.
-- POSIX.1-2008
My understanding of this is that when the first process issues a
write(handle, data1, size1) and the second process issues
write(handle, data2, size2), the writes can occur in any order but
the data1 and data2 must be both pristine and contiguous.
But running the following code gives me unexpected results.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
die(char *s)
{
perror(s);
abort();
}
main()
{
unsigned char buffer[3];
char *filename = "/tmp/atomic-write.log";
int fd, i, j;
pid_t pid;
unlink(filename);
/* XXX Adding O_APPEND to the flags cures it. Why? */
fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);
if (fd < 0)
die("open failed");
for (i = 0; i < 10; i++) {
pid = fork();
if (pid < 0)
die("fork failed");
else if (! pid) {
j = 3 + i % (sizeof(buffer) - 2);
memset(buffer, i % 26 + 'A', sizeof(buffer));
buffer[0] = '-';
buffer[j - 1] = '\n';
for (i = 0; i < 1000; i++)
if (write(fd, buffer, j) != j)
die("write failed");
exit(0);
}
}
while (wait(NULL) != -1)
/* NOOP */;
exit(0);
}
I tried running this on Linux and Mac OS X 10.7.4 and using grep -a
'^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not
contiguous or overlap (Linux) or plain corrupted (Mac OS X).
Adding the flag O_APPEND in the open(2) call fixes this
problem. Nice, but I do not understand why. POSIX says
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
but this is not the problem here. My sample program never does
lseek(2) but share the same file description and thus same file
offset.
I have already read similar questions on Stackoverflow but they still
do not fully answer my question.
Atomic write on file from two process does not specifically
address the case where the processes share the same file description
(as opposed to the same file).
How does one programmatically determine if “write” system call is atomic on a particular file? says that
The write call as defined in POSIX has no atomicity guarantee at all.
But as cited above it does have some. And what’s more,
O_APPEND seems to trigger this atomicity guarantee although it seems
to me that this guarantee should be present even without O_APPEND.
Can you explain further this behaviour ?
man 2 write on my system sums it up nicely:
Note that not all file systems are POSIX conforming.
Here is a quote from a recent discussion on the ext4 mailing list:
Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.
This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.
Edit: Updated Aug 2017 with latest changes in OS behaviours.
Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.
This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links ... [many functions] ... read() ...
write() ... If two threads each call one of these functions, each call
shall either see all of the specified effects of the other call, or
none of them. [Source]
and
Writes can be serialized with respect to other reads and writes. If a
read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. [Source]
but conversely:
This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control. [Source]
A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.
A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).
So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.
So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.
You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.
Some misinterpretation of what the standard mandates here comes from the use of processes vs. threads, and what that means for the "handle" situation you're talking about. In particular, you missed this part:
Handles can be created or destroyed by explicit user action, without affecting the underlying open file description. Some of the ways to create them include fcntl(), dup(), fdopen(), fileno(), and fork(). They can be destroyed by at least fclose(), close(), and the exec functions. [ ... ] Note that after a fork(), two handles exist where one existed before.
from the POSIX spec section you quote above. The reference to "create [ handles using ] fork" isn't elaborated on further in this section, but the spec for fork() adds a little detail:
The child process shall have its own copy of the parent's file descriptors. Each of the child's file descriptors shall refer to the same open file description with the corresponding file descriptor of the parent.
The relevant bits here are:
the child has copies of the parent's file descriptors
the child's copies refer to the same "thing" that the parent can access via said fds
file descriptors and file descriptions are not the same thing; in particular, a file descriptor is a handle in the above sense.
This is what the first quote refers to when it says "fork() creates [ ... ] handles" - they're created as copies, and therefore, from that point on, detached, and no longer updated in lockstep.
In your example program, every child process gets its very own copy which starts at the same state, but after the act of copying, these filedescriptors / handles have become independent instances, and therefore the writes race with each other. This is perfectly acceptable regarding the standard, because write() only guarentees:
On a regular file or other file capable of seeking, the actual writing of data shall proceed from the position in the file indicated by the file offset associated with fildes. Before successful return from write(), the file offset shall be incremented by the number of bytes actually written.
This means that while they all start the write at the same offset (because the fd copy was initialized as such) they might, even if successful, all write different amounts (there's no guarantee by the standard that a write request of N bytes will write exactly N bytes; it can succeed for anything 0 <= actual <= N), and due to the ordering of the writes being unspecified, the whole example program above therefore has unspecified results. Even if the total requested amount is written, all the standard above says that the file offset is incremented - it does not say it's atomically (once only) incremented, nor does it say that the actual writing of data will happen in an atomic fashion.
One thing is guaranteed though - you should never see anything in the file that has not either been there before any of the writes, or that had not come from either of the data written by any of the writes. If you do, that'd be corruption, and a bug in the filesystem implementation. What you've observed above might well be that ... if the final results can't be explained by re-ordering of parts of the writes.
The use of O_APPEND fixes this, because using that, again - see write(), does:
If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
which is the "prior to" / "no intervening" serializing behaviour that you seek.
The use of threads would change the behaviour partially - because threads, on creation, do not receive copies of the filedescriptors / handles but operate on the actual (shared) one. Threads would not (necessarily) all start writing at the same offset. But the option for partial-write-success will still means that you may see interleaving in ways you might not want to see. Yet it'd possibly still be fully standards-conformant.
Moral: Do not count on a POSIX/UNIX standard being restrictive by default. The specifications are deliberately relaxed in the common case, and require you as the programmer to be explicit about your intent.
You're misinterpreting the first part of the spec you cited:
Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply.
This does not place any requirements on the implementation to handle concurrent access. Instead, it places requirements on an application not to make concurrent access, even from different processes, if you want well-defined ordering of the output and side effects.
The only time atomicity is guaranteed is for pipes when the write size fits in PIPE_BUF.
By the way, even if the call to write were atomic for ordinary files, except in the case of writes to pipes that fit in PIPE_BUF, write can always return with a partial write (i.e. having written fewer than the requested number of bytes). This smaller-than-requested write would then be atomic, but it wouldn't help the situation at all with regards to atomicity of the entire operation (your application would have to re-call write to finish).

Resources