under what conditions are pipe reads atomic? - c

man pipe -s7 documents writing to the pipe very well.
The part important to me being that the write will only ever be partially completed if O_NONBLOCK is set, and write length is greater than PIPE_BUF.
However, nothing is said about the read end.
I am sending structures representing events through my pipe in blocking mode at the write end.
At the read end, i am processing those events (and other things) in an update loop in non-blocking mode.
Since my struct is smaller than PIPE_BUF, will read ALWAYS read a whole number of structs? Or do i need to handle the possibility of only part my struct being read ?
Common sense tells me that read behavior will mirror the documented write behavior, but i would be happier if this was specified.
I'm working on Linux ( kernel 3.8, x86_64 ). But it is important that my code is portable across different UNIX flavors, and CPU architectures.
Thanks.
Chris.

The comments are right: read is not atomic. The whole point of atomicity of write is to allow multiple writers without corruption from interleaving data. Multiple readers are much less useful, but even if they were useful, supporting atomic reads would require maintaining packet boundaries in pipes, which do not exist.

Reads from a pipe are not atomic.
From the standard page for read()
The standard developers considered adding atomicity requirements to a pipe or FIFO, but recognized that due to the nature of pipes and FIFOs there could be no guarantee of atomicity of reads of {PIPE_BUF} or any other size that would be an aid to applications portability.

Related

Atomicity of UNIX read()/write() when sending data to device

When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.

Why can't use C standard I/O with sockets

It's often said that one shouldn't use C standard I/O functions (like fprintf(), fscanf()) when working with sockets.
I can't understand why. I think if the reason was just in their buffered nature, one could just flush the output buffer each time he outputs, right?
Why everyone uses UNIX I/O functions instead? Are there any situations when the use of standard C functions is appropriate and correct?
You can certainly use stdio with sockets. You can even write a program that uses nothing but stdin and stdout, run it from inetd (which provides a socket on STDIN_FILENO and STDOUT_FILENO), and it works even though it doesn't contain any socket code at all.
What you can't do is mix buffered I/O with select or poll because there is no fselect or fpoll working on FILE *'s and you can't even implement one yourself because there's no standard way of querying a FILE * to find out whether its input buffer is empty.
As soon as you need to handle multiple connections, stdio is not good enough.
It's totally fine when you have simple scenario with one socket in blocking mode and your application protocol is text-based.
It quickly becomes a huge pain with more then one or non-blocking socket(s), with any sort of binary encoding, and with any real performance requirements.
Do not know any direct objection. Most likely this will work fine.
At the same time I can imagine that a platform, where fprintf() and fscanf() have their own buffers, staying above the file descriptor layer. You may not be able to flush these buffers.
It is hard to speak about all possible platforms. This means that it is better to avoid this with sockets.
At the end of the day the app program should solve the app problem. It should not be a compiler/library test.
It's because sockets (TCP sockets, for example) are readable and writable as if they were files or pipes, but this is just an abstraction. The inner workings of a network connection are much more complicated than a local file or pipe.
To start with, reading a file is always "fast", either you get the data or bump end-of-file. In the other hand, if you expect 500 bytes from a TCP connection and it sends 499 (and the connection is not closed), you may be waiting forever. Writing is the same thing: it will block after TCP output buffer.
Even the most basic program needs to handle timeouts, disconnection, and all these things interact with FILE's own buffered I/O, not even textbook examples could be expected to work well.

how standard specify atomic write to regular file(not pipe or fifo)?

The posix standard specified that when write less than PIPE_BUF bytes to pipe or FIFO are granted atomic, that is, our write doesn't mix with other processes'. But I failed to find out how standard specify about regular file. I mean it's true that when we write less than PIPE_BUF, it will also granted be atomic. But I want to know does regular file have such limitation? I mean, the pipe has the capacity, so that when write to the pipe and beyond its capacity, kernel will put the writer to sleep, so other process will get chance to write, but regular file seems that doesn't have to have such limitation, am i right?
What I'm doing is several processes generate log to a file. Of course, with O_APPEND set.
Quote from http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm (Single UNIX Specification, Version 4, 2010 Edition):
This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control.
The specification does address how semantics of writes regarding writes occur in case of multiple readers, but as you can see from above, the behaviour for multiple, concurrent writers is not defined by the specification.
Note above talks about files. For pipes and FIFOs the PIPE_MAX semantics apply, that concurrent writes are guaranteed to be non-divisible up to PIPE_MAX bytes.
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
For real file systems the situation is complex. Some local file systems may enforce atomic writes up to arbitrary sizes (memory limit) by locking a file handle during writing, some might not (I tried to look at ext4 logic, but lost track somewhere around http://lxr.linux.no/linux+v3.5.3/fs/jbd2/transaction.c#L147).
For non-local file systems the result is more or less for grabs. Just don't try concurrent writing on a networked file system without some form of explicit locking (or you're positively absolutely sure about the semantics of the network file system you're using).
BTW, O_APPEND guarantees that all writes by different processes go to the end of the file. However as SUS above notes, if the writes are really concurrent (occuring at the same time), then the behavior is undefined. On earlier uniprocess and non-pre-emptive UNIXes this didn't really matter, as a call to write(2) completed before someone else got a chance to write...
This question could be answered definitely for specific combination of operating system (Linux?) and file system (ext4?). A general answer? As SUS reads -- "undefined behavior".
I think this is useful to you: "the data written by writev() is written as a single block that is not intermingled with output from writes in other processes", so you can use writev
Several writers to a file may mix up things. But files opened with O_APPEND are appended atomically per write access.
If you want to keep to the C stdio interface instead of the lower level one, fopene the file with "a" or "a+" (which map to O_APPEND), set up a buffer large enough that there is no need to write inside your records and use fsync to force the write when you are done building them. I'm not sure it is guaranteed by POSIX (C says nothing about that).
There is the ultimate solut8ion to all questions of atomicity; a mutex. Wrap your writes to the log file in a mutex and all will be done atomically.
A simpler solution might be to use the GLOG libraries from Google. A fantastic logging system, far better than anything I ever dreamed up, free, not-GPL, and atomic.
One way to interleave them safely would be to have all writers lock the file, write, and unlock.
Functions that can be used for locking are flock(), lockf(), and fcntl().
Beware that ALL writers must lock (and they should all use the same mechanism to do the locking) or one that doesn't bother getting a lock could still write at the same time as another that holds a lock.

Using fseek/fwrite from multiple processes to write to different areas of a file?

I recently came across a bit of not-well-tested legacy code for writing data that's distributed across multiple processes (these are part of an MPI-based parallel computation) into the same file. Is this actually guaranteed to work?
It goes like this:
All processes open the same file for writing.
Each process calls fseek to seek to a different location within the file. This position may be past the end of the file.
Each process then writes a block of data into the file with fwrite. The seek locations
and block sizes are such that these writes completely tile a
section of the file -- no gaps, no overlaps.
Is this guaranteed to work, or will it sometimes fail horribly? There is no locking to serialize the writes, and in fact they are likely to be starting from a synchronization point. On the other hand, we can guarantee that they are writing to different file positions, unlike other questions which have had issues with trying to write to the "end of the file" from multiple processes.
It occurs to me that the processes may be on different machines that mount the file via NFS, which I suspect probably answers my question -- but, would it work if the file is local?
I believe this will typically work but there is no guarantee that I can find. The Posix specifications for fwrite(3) defer to ISO C and neither standard mentions concurrency.
So I suspect it will typically work, but fseek(3) and fwrite(3) are buffered I/O functions, so success will depend on internal details of the library implementation. So, absolutely no guarantee but various reasons to expect that it will work.
Now, should the program use lseek(2) and write(2) then I believe you could consider the results guaranteed, but now it's restricted to Posix operating systems.
One thing seems ... odd ... why would an MPI program decide to share its data via NFS and not the message API? It would seem slower, less portable, more prone to trouble, and generally just a waste of the MPI feature set. It certainly is no more distributed given the reliance on a single NFS server.

What amount of data does select (2) guarantee to be able to be written to a file without blocking

select (2) (amongst other things) tells me whether I can write to a fd of a file without blocking. However, does it guarentee me that I can write a full 4096 bytes without blocking?
Note I am interested in normal files on disk. Not sockets or the like.
In other words: does select signal when we can just write one single byte to a file fd without blocking, or does it signal when we can write n (4096, ... ?) bytes to a file fd without blocking.
Whenever select() indicates that your file is ready, you can try writing N bytes, for any N>0. write() will return the number of bytes actually written. If it equals N, you can write again. If it's less than N, then the next write will block.
Note Normal files on disk don't block. Sockets, pipes and terminals do.
You tagged this "Linux", so what does the kernel source code tell you? It should be pretty easy to read the syscall implementation to find when select decides to treat a file descriptor as ready for writing.
If you're worried about blocking, though, you're doing it wrong. If you don't want to block, use O_NONBLOCK or equivalents. Even if select did guarantee a certain number of bytes could be written without blocking, that would only be true at the time select returns; it might not necessarily be true by the time you actually perform the write.
Note I am interested in normal files on disk. Not sockets or the like.
select does not "work" with normal files, only sockets/pipes/ttys and possibly others, but not regular files. For regular files select will always signal the file descriptor as readable/writable - thus it is a rather useless exercise to use select with files.
note that that applies to other io multiplexing facilities as well, such as poll/epoll. AIO will do asynchonous io to regular files, but operating system support might vary, and it is a rather complex api to use
As to how much data you can write, there is no promise. 4096 is no magical number that select assumes you can write without blocking, when applied to filedescriptors where using select does make sense (sockets/pipes/etc.) . Because you can't know how much data you can write without blocking, you should always set the file descriptor to non-blocking, record how much was actually written as indicated by the return value of write/send and start writing from that point the next time select indicates you can write data again.
select() only promises that the applicable call can be made without blocking, it does not guarantee an I/O amount (4096) in your case. Since select() can be used with different types of descriptors (file, sockets, serial connections, etc.) you may notice that for disk operations the observed behavior is that a full buffer can always be written, but again this is specific to the particular underlying operation and not a promise of select().

Resources