When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.
Related
Background Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file. The user space program is a short lived process that compares value passed by the kernel with the file contents. At a time usually many instances of the user space program can be invoked. The file has less than one thousand lines.
Question What is the preferred way to read the a small file that is shared among short lived many processes? Currently We are using File I/O (fopen, fread)
Note The question When should I use mmap for file access? discusses very nicely but there is no discussion for the case of short lived many processes
What is the preferred way to read a small file that is shared among short lived many processes?
getline() or fread() using standard POSIX I/O from <stdio.h>, or low-level <unistd.h> open() and read() to a sufficiently large buffer (with sufficiently aggressive growth policy); depending on how the read data is parsed/interpreted.
You don't use memory mapping for reading a file once; it is just not as efficient as read()/fread(), due to the mapping overhead.
Note that if the file contains many numbers, the actual bottleneck is the string-to-integer and string-to-floating-point conversions (strtol(), strtod(), sscanf(), etc.), because if accessed often enough the file contents will stay in the page cache. The standard implementations of string conversion functions are designed for correctness, not for efficiency.
Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file.
Seems very inefficient to me. Personally, I'd keep the "file" in-kernel, as a structure, and only expose an userspace interface, probably a character device, to modify its contents.
That way you only incur a context switch whenever the "file" is changed by an userspace process, and kernel-space stuff can simply examine the contents of the structure directly, in native format, with no overhead.
This is what e.g. netfilter (built-in firewall) and other existing stuff do.
Posix compliance is a standard that is been followed by many a companies.
I have few question around this area,
1. does all the file systems need to be posix compliant?
2. are applications also required to be posix compliant?
3. are there any non posix filesystems?
In the area of "requires POSIX filesystem semantics" what is typically meant is:
allows hierarchical file names and resolution (., .., ...)
supports at least close-to-open semantics
umask/unix permissions, 3 filetimes
8bit byte support
supports atomic renames on same filesystem
fsync()/dirfsync() durability gurantee/limitation
supports multi-user protection (resizing file returns 0 bytes not previous content)
rename and delete open files (Windows does not do that)
file names supporting all bytes beside '/' and \0
Sometimes it also means symlink/hardlink support as well as file names and 32bit file pointers (minimum). In some cases it is also used to refer specific API features like fcntl() locking, mmap() or truncate() or AIO.
When I think about POSIX compliance for distributed file systems, I use the general standard that a distributed file system is POSIX compliant if multiple processes running on different nodes see the same behavior as if they were running on the same node using a local file system. This basically has two implications:
If the system has multiple buffer-caches, it needs to ensure cache consistency.
Various mechanisms to do so include locks and leases. An example of incorrect behavior in this case would be a writer who writes successfully on one node but then a reader on a different node receives old data.
Note however that if the writer/reader are independently racing one another that there is no correct defined behavior because they do not know which operation will occur first. But if they are coordinating with each other via some mechanism like messaging, then it would be incorrect if the writer completes (especially if it issues a sync call), sends a message to the reader which is successfully received by the reader, and then the reader reads and gets stale data.
If data is striped across multiple data servers, reads and writes that span multiple stripes must be atomic.
For example, when a reader reads across stripes at the same time as a writer writes across those same stripes, then the reader should either receive all stripes as they were before the write or all stripes as they were after the write. Incorrect behavior would be for the reader to receive some old and some new.
Contrary to the above, this behavior must work correctly even when the writer/reader are racing.
Although my examples were reads/writes to a single file, correct behavior also includes write/writes to a single file as well as read/writes and write/writes to the hierarchical namespace via calls such as stat/readdir/mkdir/unlink/etc.
Answering your questions in a very objective way:
1. does all the file systems need to be posix compliant?
Actually not. In fact POSIX defines some standards for operational systems in general. Good to have, but no really required.
2. are applications also required to be posix compliant?
No.
3. are there any non posix filesystems?
HDFS (hadoop file system)
man pipe -s7 documents writing to the pipe very well.
The part important to me being that the write will only ever be partially completed if O_NONBLOCK is set, and write length is greater than PIPE_BUF.
However, nothing is said about the read end.
I am sending structures representing events through my pipe in blocking mode at the write end.
At the read end, i am processing those events (and other things) in an update loop in non-blocking mode.
Since my struct is smaller than PIPE_BUF, will read ALWAYS read a whole number of structs? Or do i need to handle the possibility of only part my struct being read ?
Common sense tells me that read behavior will mirror the documented write behavior, but i would be happier if this was specified.
I'm working on Linux ( kernel 3.8, x86_64 ). But it is important that my code is portable across different UNIX flavors, and CPU architectures.
Thanks.
Chris.
The comments are right: read is not atomic. The whole point of atomicity of write is to allow multiple writers without corruption from interleaving data. Multiple readers are much less useful, but even if they were useful, supporting atomic reads would require maintaining packet boundaries in pipes, which do not exist.
Reads from a pipe are not atomic.
From the standard page for read()
The standard developers considered adding atomicity requirements to a pipe or FIFO, but recognized that due to the nature of pipes and FIFOs there could be no guarantee of atomicity of reads of {PIPE_BUF} or any other size that would be an aid to applications portability.
The posix standard specified that when write less than PIPE_BUF bytes to pipe or FIFO are granted atomic, that is, our write doesn't mix with other processes'. But I failed to find out how standard specify about regular file. I mean it's true that when we write less than PIPE_BUF, it will also granted be atomic. But I want to know does regular file have such limitation? I mean, the pipe has the capacity, so that when write to the pipe and beyond its capacity, kernel will put the writer to sleep, so other process will get chance to write, but regular file seems that doesn't have to have such limitation, am i right?
What I'm doing is several processes generate log to a file. Of course, with O_APPEND set.
Quote from http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm (Single UNIX Specification, Version 4, 2010 Edition):
This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control.
The specification does address how semantics of writes regarding writes occur in case of multiple readers, but as you can see from above, the behaviour for multiple, concurrent writers is not defined by the specification.
Note above talks about files. For pipes and FIFOs the PIPE_MAX semantics apply, that concurrent writes are guaranteed to be non-divisible up to PIPE_MAX bytes.
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
For real file systems the situation is complex. Some local file systems may enforce atomic writes up to arbitrary sizes (memory limit) by locking a file handle during writing, some might not (I tried to look at ext4 logic, but lost track somewhere around http://lxr.linux.no/linux+v3.5.3/fs/jbd2/transaction.c#L147).
For non-local file systems the result is more or less for grabs. Just don't try concurrent writing on a networked file system without some form of explicit locking (or you're positively absolutely sure about the semantics of the network file system you're using).
BTW, O_APPEND guarantees that all writes by different processes go to the end of the file. However as SUS above notes, if the writes are really concurrent (occuring at the same time), then the behavior is undefined. On earlier uniprocess and non-pre-emptive UNIXes this didn't really matter, as a call to write(2) completed before someone else got a chance to write...
This question could be answered definitely for specific combination of operating system (Linux?) and file system (ext4?). A general answer? As SUS reads -- "undefined behavior".
I think this is useful to you: "the data written by writev() is written as a single block that is not intermingled with output from writes in other processes", so you can use writev
Several writers to a file may mix up things. But files opened with O_APPEND are appended atomically per write access.
If you want to keep to the C stdio interface instead of the lower level one, fopene the file with "a" or "a+" (which map to O_APPEND), set up a buffer large enough that there is no need to write inside your records and use fsync to force the write when you are done building them. I'm not sure it is guaranteed by POSIX (C says nothing about that).
There is the ultimate solut8ion to all questions of atomicity; a mutex. Wrap your writes to the log file in a mutex and all will be done atomically.
A simpler solution might be to use the GLOG libraries from Google. A fantastic logging system, far better than anything I ever dreamed up, free, not-GPL, and atomic.
One way to interleave them safely would be to have all writers lock the file, write, and unlock.
Functions that can be used for locking are flock(), lockf(), and fcntl().
Beware that ALL writers must lock (and they should all use the same mechanism to do the locking) or one that doesn't bother getting a lock could still write at the same time as another that holds a lock.
I recently came across a bit of not-well-tested legacy code for writing data that's distributed across multiple processes (these are part of an MPI-based parallel computation) into the same file. Is this actually guaranteed to work?
It goes like this:
All processes open the same file for writing.
Each process calls fseek to seek to a different location within the file. This position may be past the end of the file.
Each process then writes a block of data into the file with fwrite. The seek locations
and block sizes are such that these writes completely tile a
section of the file -- no gaps, no overlaps.
Is this guaranteed to work, or will it sometimes fail horribly? There is no locking to serialize the writes, and in fact they are likely to be starting from a synchronization point. On the other hand, we can guarantee that they are writing to different file positions, unlike other questions which have had issues with trying to write to the "end of the file" from multiple processes.
It occurs to me that the processes may be on different machines that mount the file via NFS, which I suspect probably answers my question -- but, would it work if the file is local?
I believe this will typically work but there is no guarantee that I can find. The Posix specifications for fwrite(3) defer to ISO C and neither standard mentions concurrency.
So I suspect it will typically work, but fseek(3) and fwrite(3) are buffered I/O functions, so success will depend on internal details of the library implementation. So, absolutely no guarantee but various reasons to expect that it will work.
Now, should the program use lseek(2) and write(2) then I believe you could consider the results guaranteed, but now it's restricted to Posix operating systems.
One thing seems ... odd ... why would an MPI program decide to share its data via NFS and not the message API? It would seem slower, less portable, more prone to trouble, and generally just a waste of the MPI feature set. It certainly is no more distributed given the reliance on a single NFS server.