Guarantees of order of the operations on file - filesystems

I'd like to know if there any guarantees on the order of the operations on file/file system.
Consider I have a file foo.dat and I update it as following:
lseek(fd,pos_a,SEEK_SET);
write(fd,data_a,size_a); //< - Operation A
lseek(fd,pos_b,SEEK_SET);
write(fd,data_b,size_b); //< - Operation B
lseek(fd,pos_c,SEEK_SET);
write(fd,data_c,size_c); //< - Operation C
Such that I do updates in file A, B, C. And some fault may occur - software crash or for example power failure.
Is there any guarantees that the operations if they are executed are done in same order.
i.e. That there would be either nothing or "A" or "A and B" or "A and B and C"
But not situations like "A and C" or "B" only.
I know that if I call between A and B fsync(fd) and same between A and C but it
also guarantees that is actually on file system.
I less care about loose of data but rather about its consistency.
Does POSIX standard guarantees that there will be no out-of-order execution?
So:
As there any such guarantees?
On POSIX platform?
On Windows platform?
If not what guarantees (besides fsync) I can have?

This is what POSIX mandates for write:
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.
Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
This does not provide you with a guarantee that your data will hit the disk in that order at all. The implementation can re-order physical writes all it wants as long is what the applications "see" is consistent with the above two statements.
In practice, the kernel, and even the disk subsystem (think SANs for instance) can re-order writes (for performance reasons usually).
So you can't rely on the order of your write calls for consistency. You'll need f[data]syncs.
Interesting email thread on the PostgreSQL mailing list: POSIX file updates. Reading on how databases handle I/O is a great way to learn about this type of issue.
(Sorry, don't know about Windows in this respect.)

Related

Will WriteFile() be atomic if the process is terminated but the system continues running?

If my process is terminated at a random moment but the operating system continues to run properly, will Windows guarantee that individual calls to WriteFile are atomic (a.k.a. all-or-nothing)?
Or can I get partial/torn writes?
Note: I am specifically NOT asking for advice on how to practice defensive coding.
This is strictly a question about the behavior of the Microsoft Windows operating system itself.
To be 100% perfectly crystal clear, we can and explicitly do trust the user code to behave sanely. There is no undefined behavior or anything of the sort. All process terminations are assumed to occur through a well-defined behavior such as unhandled exceptions or calls to TerminateProcess, not memory corruption, etc.
Also, specifically note that there are no C++ destructors to worry about here; this is C.
I hope that puts all the secondary concerns about the user code to rest.
WriteFile is certainly not atomic in the case of your process being terminated while it is executing, it is not even atomic if your process is not being killed.
Also, "all or nothing written" is not even a proper definition of an atomic write. All could be written, but intermingled with an independent write from another process. If writes are guaranteed to be atomic, there must be a guarantee (read as: lock) that this doesn't happen.
Apart from the fact that implementing proper atomicity would be considerable extra trouble with very little to gain for the average everyday user, you can also guess that WriteFile is not atomic from:
The absence of mention in the API documentation. You can bet that this would be prominently mentioned, as it is a really big, distinguishing feature.
The presence of the lpNumberOfBytesWritten parameter. A write might still fail (e.g. disk full) but if the function was guaranteed to be atomic, you would know that it either succeeded or failed, and you already know how many bytes you were going to write, so returning that number is unnecessary.
The presence of TxF. Although TxF does a lot more than just making single writes atomic, it is reasonable to assume that Microsoft wouldn't waste considerable time and money in implementing such a beast when "normal" filesystem operations already more or less work the like anyway.
No other mainstream operation system that I know of gives such a guarantee. Linux does give a sort of atomicity guarantee on writev (but not on write) insofar as your writes will not be intermingled with writes from other processes. But that is not at all the same thing as guaranteeing atomicity in presence of process termination.
However, overlapped writes on a handle opened with FILE_FLAG_NO_BUFFERING are technically atomic in respect of process termination (but not in respect of failure, such as disk full or in any other respect!). Saying so is admittedly a bit of a sophistry on an implementation detail, not an actual guarantee given by the operating system, but from a certain point of view it's certainly correct to say so.
A process that is performing an unbuffered, overlapped I/O operation cannot be terminated. That is because the OS is doing DMA transfers into that process' address space. Which of course means that the process cannot be terminated since the OS would reclaim the physical pages. The OS will therefore refuse to terminate a process while such an I/O operation is running.
You can verify this by firing off a couple of big unbuffered overlapped requests (a few GB) and try killing your process in Task Manager. It will only be killed when the I/O is complete (so, after some seconds). That comes as a big surprise when you see it happen for the first time and don't expect it!

Atomicity of UNIX read()/write() when sending data to device

When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.

what is posix compliance for filesystem?

Posix compliance is a standard that is been followed by many a companies.
I have few question around this area,
1. does all the file systems need to be posix compliant?
2. are applications also required to be posix compliant?
3. are there any non posix filesystems?
In the area of "requires POSIX filesystem semantics" what is typically meant is:
allows hierarchical file names and resolution (., .., ...)
supports at least close-to-open semantics
umask/unix permissions, 3 filetimes
8bit byte support
supports atomic renames on same filesystem
fsync()/dirfsync() durability gurantee/limitation
supports multi-user protection (resizing file returns 0 bytes not previous content)
rename and delete open files (Windows does not do that)
file names supporting all bytes beside '/' and \0
Sometimes it also means symlink/hardlink support as well as file names and 32bit file pointers (minimum). In some cases it is also used to refer specific API features like fcntl() locking, mmap() or truncate() or AIO.
When I think about POSIX compliance for distributed file systems, I use the general standard that a distributed file system is POSIX compliant if multiple processes running on different nodes see the same behavior as if they were running on the same node using a local file system. This basically has two implications:
If the system has multiple buffer-caches, it needs to ensure cache consistency.
Various mechanisms to do so include locks and leases. An example of incorrect behavior in this case would be a writer who writes successfully on one node but then a reader on a different node receives old data.
Note however that if the writer/reader are independently racing one another that there is no correct defined behavior because they do not know which operation will occur first. But if they are coordinating with each other via some mechanism like messaging, then it would be incorrect if the writer completes (especially if it issues a sync call), sends a message to the reader which is successfully received by the reader, and then the reader reads and gets stale data.
If data is striped across multiple data servers, reads and writes that span multiple stripes must be atomic.
For example, when a reader reads across stripes at the same time as a writer writes across those same stripes, then the reader should either receive all stripes as they were before the write or all stripes as they were after the write. Incorrect behavior would be for the reader to receive some old and some new.
Contrary to the above, this behavior must work correctly even when the writer/reader are racing.
Although my examples were reads/writes to a single file, correct behavior also includes write/writes to a single file as well as read/writes and write/writes to the hierarchical namespace via calls such as stat/readdir/mkdir/unlink/etc.
Answering your questions in a very objective way:
1. does all the file systems need to be posix compliant?
Actually not. In fact POSIX defines some standards for operational systems in general. Good to have, but no really required.
2. are applications also required to be posix compliant?
No.
3. are there any non posix filesystems?
HDFS (hadoop file system)

how standard specify atomic write to regular file(not pipe or fifo)?

The posix standard specified that when write less than PIPE_BUF bytes to pipe or FIFO are granted atomic, that is, our write doesn't mix with other processes'. But I failed to find out how standard specify about regular file. I mean it's true that when we write less than PIPE_BUF, it will also granted be atomic. But I want to know does regular file have such limitation? I mean, the pipe has the capacity, so that when write to the pipe and beyond its capacity, kernel will put the writer to sleep, so other process will get chance to write, but regular file seems that doesn't have to have such limitation, am i right?
What I'm doing is several processes generate log to a file. Of course, with O_APPEND set.
Quote from http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm (Single UNIX Specification, Version 4, 2010 Edition):
This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control.
The specification does address how semantics of writes regarding writes occur in case of multiple readers, but as you can see from above, the behaviour for multiple, concurrent writers is not defined by the specification.
Note above talks about files. For pipes and FIFOs the PIPE_MAX semantics apply, that concurrent writes are guaranteed to be non-divisible up to PIPE_MAX bytes.
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
For real file systems the situation is complex. Some local file systems may enforce atomic writes up to arbitrary sizes (memory limit) by locking a file handle during writing, some might not (I tried to look at ext4 logic, but lost track somewhere around http://lxr.linux.no/linux+v3.5.3/fs/jbd2/transaction.c#L147).
For non-local file systems the result is more or less for grabs. Just don't try concurrent writing on a networked file system without some form of explicit locking (or you're positively absolutely sure about the semantics of the network file system you're using).
BTW, O_APPEND guarantees that all writes by different processes go to the end of the file. However as SUS above notes, if the writes are really concurrent (occuring at the same time), then the behavior is undefined. On earlier uniprocess and non-pre-emptive UNIXes this didn't really matter, as a call to write(2) completed before someone else got a chance to write...
This question could be answered definitely for specific combination of operating system (Linux?) and file system (ext4?). A general answer? As SUS reads -- "undefined behavior".
I think this is useful to you: "the data written by writev() is written as a single block that is not intermingled with output from writes in other processes", so you can use writev
Several writers to a file may mix up things. But files opened with O_APPEND are appended atomically per write access.
If you want to keep to the C stdio interface instead of the lower level one, fopene the file with "a" or "a+" (which map to O_APPEND), set up a buffer large enough that there is no need to write inside your records and use fsync to force the write when you are done building them. I'm not sure it is guaranteed by POSIX (C says nothing about that).
There is the ultimate solut8ion to all questions of atomicity; a mutex. Wrap your writes to the log file in a mutex and all will be done atomically.
A simpler solution might be to use the GLOG libraries from Google. A fantastic logging system, far better than anything I ever dreamed up, free, not-GPL, and atomic.
One way to interleave them safely would be to have all writers lock the file, write, and unlock.
Functions that can be used for locking are flock(), lockf(), and fcntl().
Beware that ALL writers must lock (and they should all use the same mechanism to do the locking) or one that doesn't bother getting a lock could still write at the same time as another that holds a lock.

Using fseek/fwrite from multiple processes to write to different areas of a file?

I recently came across a bit of not-well-tested legacy code for writing data that's distributed across multiple processes (these are part of an MPI-based parallel computation) into the same file. Is this actually guaranteed to work?
It goes like this:
All processes open the same file for writing.
Each process calls fseek to seek to a different location within the file. This position may be past the end of the file.
Each process then writes a block of data into the file with fwrite. The seek locations
and block sizes are such that these writes completely tile a
section of the file -- no gaps, no overlaps.
Is this guaranteed to work, or will it sometimes fail horribly? There is no locking to serialize the writes, and in fact they are likely to be starting from a synchronization point. On the other hand, we can guarantee that they are writing to different file positions, unlike other questions which have had issues with trying to write to the "end of the file" from multiple processes.
It occurs to me that the processes may be on different machines that mount the file via NFS, which I suspect probably answers my question -- but, would it work if the file is local?
I believe this will typically work but there is no guarantee that I can find. The Posix specifications for fwrite(3) defer to ISO C and neither standard mentions concurrency.
So I suspect it will typically work, but fseek(3) and fwrite(3) are buffered I/O functions, so success will depend on internal details of the library implementation. So, absolutely no guarantee but various reasons to expect that it will work.
Now, should the program use lseek(2) and write(2) then I believe you could consider the results guaranteed, but now it's restricted to Posix operating systems.
One thing seems ... odd ... why would an MPI program decide to share its data via NFS and not the message API? It would seem slower, less portable, more prone to trouble, and generally just a waste of the MPI feature set. It certainly is no more distributed given the reliance on a single NFS server.

Resources