read() from files - blocking vs. non-blocking behavior - c

Let's assume we opened a file using fopen() and from the file-pointer received, fetch the file-descriptor using fileno(). Then we do lots (>10^8) of random read()s of relativly small chunks, between a size of 4Bytes to 10KBytes from this file:
Is it expected behaviour such a read() might return less bytes then requested, without setting errno, if the file-system is an
ext3
NFS
OCFS2
combination of 2 and 3 (OCFS2 via NFS)
?
My readings gave me the conclusion it should not be possible for 1. (if the file has not O_NONBLOCK set, if ever possible for ext3 to have it set) but for the other three (2., 3., 4.) I'm uncertain.
(Btw: Could I assume having O_NONBLOCK not set to be the default in any case?)
This questions arose because I observed read()s returning less bytes then requested without errno set in case 4.
The problem to drill this down by testing is that such behaviour happens in <1/1000000000 cases ... - which is still too often :-}
Update: The average file size is between some TBytes and around 1GByte.

You should not assume that read() will not return less bytes than requested for any filesystem. This is particularly true in the case of large reads, as POSIX.1 indicates that read() behavior for sizes larger than SSIZE_MAX is implementation-dependent. On this mainstream Unix box I'm using right now, SSIZE_MAX is 32767 bytes. The fact that read() always returns the full amount today does not mean that it will in the future.
One possible reason might be that I/O priorities are more fully fleshed out in the kernel in the future. E.g. you're trying to read from the same device as another higher priority process and the other process would get better throughput if your process wasn't causing head movement away from the sectors the other process wants. The kernel might choose to give your read() a short count to get you out of the way for a while, instead of continuing to do inefficient interleaved block reads. Stranger things have been done for the sake of I/O efficiency. What is not prohibited often becomes compulsory.

We solved the problem described as having read() return less bytes then request when reading from a file located on a NFS mount, pointing to an OCFS2 file system (case 4 in my question).
It is a fact that using the setup mentioned above, such read()s on file descriptors sometimes return less bytes then requested, without having errno set.
To have all data read it is as simple as just read()ing again and again up until the amount of data requested had been read.
Moreover such setup sometimes makes read() fail with EIO, and even then a simple re-read() leads to success and data arrives.
My conclusion: Reading via OCFS2 via NFS makes read()ing from files behave like read()ing from sockets which is inconsistent with the specifications of read() http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html :
When attempting to read a file (other than a pipe or FIFO) that
supports non-blocking reads and has no data currently available:
If O_NONBLOCK is set, read() shall return -1 and set errno to [EAGAIN].
If O_NONBLOCK is clear, read() shall block the calling thread until some data becomes available.
No need to say we never ever tried, nor even thought about to set O_NONBLOCK for the file descriptors in question.

Related

Understanding read syscall

I'm reading man read manual page and discovered that it was possible to read less then the desired number of bytes passed in as a parameter:
It is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
termi‐nal), or because read() was interrupted by a signal.
I have the following situation:
Some process moved a file into a directory I'm listening to IN_MOVED_TO inotify events.
I receive a IN_MOVED_TO event, open a file and start reading it till the EOF is reached
No other processes modify the moved at 1. file (After it is moved it is left unchanged all the time)
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0? I mean the situation like 'reading 1 000 000 000 by a single bytes for a gigabyte file' is forbidden by the documentation
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0?
No, not in practice. It should be true if the file system is entirely POSIX compliant, but many of them are not (in corner cases). In particular NFS (see nfs(5)) and FUSE or proc (see proc(5)) are not exactly POSIX compliant.
So in practice I strongly recommend handling the "read returns a smaller number of bytes than wanted case", even if you are right to believe that it should not happen. Handling that "impossible" case should be easy for you.
Notice also that inotify(7) facilities don't work with bizarre filesystems like NFS, proc, FUSE, ... Think also of corner cases like, inside an Ext4 file system, a symlink to an NFS file,; or bind mounts, etc...

Is write() safe to be called from multiple threads simultaneously?

Assuming I have opened dev/poll as mDevPoll, is it safe for me to call code like this
struct pollfd tmp_pfd;
tmp_pfd.fd = fd;
tmp_pfd.events = POLLIN;
// Write pollfd to /dev/poll
write(mDevPoll, &tmp_pfd, sizeof(struct pollfd));
...simultaneously from multiple threads, or do I need to add my own synchronisation primitive around mDevPoll?
Solaris 10 claims to be POSIX compliant. The write() function is not among the handful of system interfaces that POSIX permits to be non-thread-safe, so we can conclude that that on Solaris 10, it is safe in a general sense to call write() simultaneously from two or more threads.
POSIX also designates write() among those functions whose effects are atomic relative to each other when they operate on regular files or symbolic links. Specifically, it says that
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
If your writes were directed to a regular file then that would be sufficient to conclude that your proposed multi-thread actions are safe, in the sense that they would not interfere with one another, and the data written in one call would not be commingled with that written by a different call in any thread. Unfortunately, /dev/poll is not a regular file, so that does not apply directly to you.
You should also be aware that write() is not in general required to transfer the full number of bytes specified in a single call. For general purposes, one must therefore be prepared to transfer the desired bytes over multiple calls, by using a loop. Solaris may provide applicable guarantees beyond those expressed by POSIX, perhaps specific to the destination device, but absent such guarantees it is conceivable that one of your threads performs a partial write, and the next write is performed by a different thread. That very likely would not produce the results you want or expect.
It's not safe in theory, even though write() is completely thread-safe (barring implementation bugs...). Per the POSIX write() standard (emphasis mine):
.
The write() function shall attempt to write nbyte bytes from the
buffer pointed to by buf to the file associated with the open file
descriptor, fildes.
...
RETURN VALUE
Upon successful completion, these functions shall return the number of bytes actually written ...
There is no guarantee that you won't get a partial write(), so even if each individual write() call is atomic, it's not necessarily complete, so you could still get interleaved data because it may take more than one call to write() to completely write all data.
In practice, if you're only doing relatively small write() calls, you will likely never see a partial write(), with "small" and "likely" being indeterminate values dependent on your implementation.
I've routinely delivered code that uses unlocked single write() calls on regular files opened with O_APPEND in order to improve the performance of logging - build a log entry then write() the entire entry with one call. I've never seen a partial or interleaved write() result over almost a couple of decades of doing that on Linux and Solaris systems, even when many processes write to the same log file. But then again, it's a text log file and if a partial or interleaved write() does happen there would be no real damage done or even data lost.
In this case, though, you're "writing" a handful of bytes to a kernel structure. You can dig through the Solaris /dev/poll kernel driver source code at Illumos.org and see how likely a partial write() is. I'd suspect it's practically impossible - because I just went back and looked at the multiplatform poll class that I wrote for my company's software library a decade ago. On Solaris it uses /dev/poll and unlocked write() calls from multiple threads. And it's been working fine for a decade...
Solaris /dev/pool Device Driver Source Code Analysis
The (Open)Solaris source code can be found here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/devpoll.c#628
The dpwrite() function is the code in the /dev/poll driver that actually performs the "write" operation. I use quotes because it's not really a write operation at all - data isn't transferred as much as the data in the kernel that represents the set of file descriptors being polled is updated.
Data is copied from user space into kernel space - to a memory buffer obtained with kmem_alloc(). I don't see any possible way that can be a partial copy. Either the allocation succeeds or it doesn't. The code can get interrupted before doing anything, as it wait for exclusive write() access to the kernel structures.
After that, the last return call is at the end - and if there's no error, the entire call is marked successful, or the entire call fails on any error:
995 if (error == 0) {
996 /*
997 * The state of uio_resid is updated only after the pollcache
998 * is successfully modified.
999 */
1000 uioskip(uiop, copysize);
1001 }
1002 return (error);
1003}
If you dig through Solaris kernel code, you'll see that uio_resid is what ends up being the value returned by write() after a successful call.
So the call certainly appears to be all-or-nothing. While there appear to be ways for the code to return an error on a file descriptor after successfully processing an earlier descriptor when multiple descriptors are passed in, the code doesn't appear to return any partial success indications.
If you're only processing one file descriptor at a time, I'd say the /dev/poll write() operation is completely thread-safe, and it's almost certainly thread-safe for "writing" updates to multiple file descriptors as there's no apparent way for the driver to return a partial write() result.

Is read() on a nonblocking socket "greedy" on platforms other than Linux (OSX, FreeBSD)?

Consider the following invocation of read() on a nonblocking stream-mode socket (SOCK_STREAM):
ssize_t n = read(socket_fd, buffer, size);
Assume that the remote peer will not close the connection, and will not shut down its writing half of the connection (the reading half, from a local point of view).
On Linux, a short read (n > 0 && n < size) under these circumstances means that the kernel-level read buffer has been exhausted, and an immediate follow-up invocation would normally fail with EAGAIN/EWOULDBLOCK (it would fail unless new data manages to arrive in between the two calls).
In other words, on Linux, an invocation of read() will always consume everything that is immediately available provided that size is large enough.
Likewise for write(), on Linux a short write always means that the kernel-level buffer was filled, and an immediate follow-up invocation is likely to fail with EAGAIN/EWOULDBLOCK.
Question 1: Is this also guaranteed on macOS/OSX?
Question 2: Is this also guaranteed on FreeBSD?
Question 3: Is this required/guaranteed by POSIX?
I know this is true on Linux, because of the following note in the manual page for epoll (section 7):
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
EDIT: As a motivation for the question, consider a case where you want to process input on a number of sockets simultaneously, and for whatever reason, you want to do this by fully exhausting in-kernel buffers for each socket in turn (i.e., "depth first" rather than "breadth first"). This can obviously be done by repeating a read on a ready-ready socket until it fails with EAGAIN/EWOULDBLOCK, but the last invocation would be redundant if the previous read was short, and we knew that a short read was a guarantee of exhaustion.
It is guaranteed by Posix:
data shall be returned to the user as soon as it becomes available.
... and therefore on all the other platforms you mention as well, and also Windows, OS/2, NetWare, ...
Any other implementation would be pointless.

What amount of data does select (2) guarantee to be able to be written to a file without blocking

select (2) (amongst other things) tells me whether I can write to a fd of a file without blocking. However, does it guarentee me that I can write a full 4096 bytes without blocking?
Note I am interested in normal files on disk. Not sockets or the like.
In other words: does select signal when we can just write one single byte to a file fd without blocking, or does it signal when we can write n (4096, ... ?) bytes to a file fd without blocking.
Whenever select() indicates that your file is ready, you can try writing N bytes, for any N>0. write() will return the number of bytes actually written. If it equals N, you can write again. If it's less than N, then the next write will block.
Note Normal files on disk don't block. Sockets, pipes and terminals do.
You tagged this "Linux", so what does the kernel source code tell you? It should be pretty easy to read the syscall implementation to find when select decides to treat a file descriptor as ready for writing.
If you're worried about blocking, though, you're doing it wrong. If you don't want to block, use O_NONBLOCK or equivalents. Even if select did guarantee a certain number of bytes could be written without blocking, that would only be true at the time select returns; it might not necessarily be true by the time you actually perform the write.
Note I am interested in normal files on disk. Not sockets or the like.
select does not "work" with normal files, only sockets/pipes/ttys and possibly others, but not regular files. For regular files select will always signal the file descriptor as readable/writable - thus it is a rather useless exercise to use select with files.
note that that applies to other io multiplexing facilities as well, such as poll/epoll. AIO will do asynchonous io to regular files, but operating system support might vary, and it is a rather complex api to use
As to how much data you can write, there is no promise. 4096 is no magical number that select assumes you can write without blocking, when applied to filedescriptors where using select does make sense (sockets/pipes/etc.) . Because you can't know how much data you can write without blocking, you should always set the file descriptor to non-blocking, record how much was actually written as indicated by the return value of write/send and start writing from that point the next time select indicates you can write data again.
select() only promises that the applicable call can be made without blocking, it does not guarantee an I/O amount (4096) in your case. Since select() can be used with different types of descriptors (file, sockets, serial connections, etc.) you may notice that for disk operations the observed behavior is that a full buffer can always be written, but again this is specific to the particular underlying operation and not a promise of select().

"short read" from filesystem, when can it happen?

It is obvious that in general the read(2) system call can return less bytes than what was asked to be read. However, quite a few programs assume that when working with a local files, read(2) never returns less than what was asked (unless the file is shorter, of course).
So, my question is: on Linux, in which cases can read(2) return less than what was requested if reading from an open file and EOF is not encountered and the amount being read is a few kilobytes at maximum?
Some guesses:
Can received signals interrupt a read like that, but not make it fail?
Can different filesystems affect this behavior? Is there anything special about jffs2?
POSIX.1-2008 states:
The value returned may be less than
nbyte if the number of bytes left in
the file is less than nbyte, if the
read() request was interrupted by a
signal, or if the file is a pipe or
FIFO or special file and has fewer
than nbyte bytes immediately available
for reading.
Disk-based filesystems generally use uninterruptible reads, which means that the
read operation generally cannot be interrupted by a signal. Network-based
filesystems sometimes use interruptible reads, which can return partial data or no data.
(In the case of NFS this is configurable using the intr mount option.)
They sometimes also implement timeouts.
Keep in mind that even /some/arbitrary/file/path may refer to a FIFO or
special file, so what you thought was a regular file may not be. It is therefore
good practice to handle partial reads even though they may be unlikely.
I have to ask: "why do you care about the reason"? If read can return a number of bytes less than the requested amount (which, as you point out, it certainly can) why would you not want to deal with that situation?
A received signal only makes read() fail if it hasn't yet read a single byte. Otherwise, it will return partial data.
And I guess alternate filesystems may indeed return short reads in other situations. For example, it makes some sense (to me) to have a network-based filesystem behave just like a network socket wrt short reads (= having them often).
If it's really a file you are reading, then you can get short read as the last read before end of file.
Howver, it's generally best to behave as if ANY read could be a short read. If what you are reading is a pipe or an input device (stdin) rather than a file, you can get a short read whenever your buffer is larger than what is currently in the input buffer.
I am not sure but this situation could arise when the OS is running out of pages in the page cache. You could suggest that flush thread will be invoked in that case, but it depends on the heuristic used in the I/O scheduler. This situation could cause a read to return fewer bytes.
What I have always read being called a "short read" is not related to the file access read(2) but to the physical read of a disk sector. It happens when, while reading the data part of the sector, less valid magnetic signals are found than to make the 512 (or 4096 or whatever) bytes of a sector. That makes an invalid sector and a read fault. Regarding "when", or rather why it happens is most probably because the power feeding the drive fell down while that sector was written.
Could it be that a read(2) ends with a physical error code called "short read"?

Resources