"short read" from filesystem, when can it happen? - c

It is obvious that in general the read(2) system call can return less bytes than what was asked to be read. However, quite a few programs assume that when working with a local files, read(2) never returns less than what was asked (unless the file is shorter, of course).
So, my question is: on Linux, in which cases can read(2) return less than what was requested if reading from an open file and EOF is not encountered and the amount being read is a few kilobytes at maximum?
Some guesses:
Can received signals interrupt a read like that, but not make it fail?
Can different filesystems affect this behavior? Is there anything special about jffs2?

POSIX.1-2008 states:
The value returned may be less than
nbyte if the number of bytes left in
the file is less than nbyte, if the
read() request was interrupted by a
signal, or if the file is a pipe or
FIFO or special file and has fewer
than nbyte bytes immediately available
for reading.
Disk-based filesystems generally use uninterruptible reads, which means that the
read operation generally cannot be interrupted by a signal. Network-based
filesystems sometimes use interruptible reads, which can return partial data or no data.
(In the case of NFS this is configurable using the intr mount option.)
They sometimes also implement timeouts.
Keep in mind that even /some/arbitrary/file/path may refer to a FIFO or
special file, so what you thought was a regular file may not be. It is therefore
good practice to handle partial reads even though they may be unlikely.

I have to ask: "why do you care about the reason"? If read can return a number of bytes less than the requested amount (which, as you point out, it certainly can) why would you not want to deal with that situation?

A received signal only makes read() fail if it hasn't yet read a single byte. Otherwise, it will return partial data.
And I guess alternate filesystems may indeed return short reads in other situations. For example, it makes some sense (to me) to have a network-based filesystem behave just like a network socket wrt short reads (= having them often).

If it's really a file you are reading, then you can get short read as the last read before end of file.
Howver, it's generally best to behave as if ANY read could be a short read. If what you are reading is a pipe or an input device (stdin) rather than a file, you can get a short read whenever your buffer is larger than what is currently in the input buffer.

I am not sure but this situation could arise when the OS is running out of pages in the page cache. You could suggest that flush thread will be invoked in that case, but it depends on the heuristic used in the I/O scheduler. This situation could cause a read to return fewer bytes.

What I have always read being called a "short read" is not related to the file access read(2) but to the physical read of a disk sector. It happens when, while reading the data part of the sector, less valid magnetic signals are found than to make the 512 (or 4096 or whatever) bytes of a sector. That makes an invalid sector and a read fault. Regarding "when", or rather why it happens is most probably because the power feeding the drive fell down while that sector was written.
Could it be that a read(2) ends with a physical error code called "short read"?

Related

Understanding read syscall

I'm reading man read manual page and discovered that it was possible to read less then the desired number of bytes passed in as a parameter:
It is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
termi‐nal), or because read() was interrupted by a signal.
I have the following situation:
Some process moved a file into a directory I'm listening to IN_MOVED_TO inotify events.
I receive a IN_MOVED_TO event, open a file and start reading it till the EOF is reached
No other processes modify the moved at 1. file (After it is moved it is left unchanged all the time)
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0? I mean the situation like 'reading 1 000 000 000 by a single bytes for a gigabyte file' is forbidden by the documentation
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0?
No, not in practice. It should be true if the file system is entirely POSIX compliant, but many of them are not (in corner cases). In particular NFS (see nfs(5)) and FUSE or proc (see proc(5)) are not exactly POSIX compliant.
So in practice I strongly recommend handling the "read returns a smaller number of bytes than wanted case", even if you are right to believe that it should not happen. Handling that "impossible" case should be easy for you.
Notice also that inotify(7) facilities don't work with bizarre filesystems like NFS, proc, FUSE, ... Think also of corner cases like, inside an Ext4 file system, a symlink to an NFS file,; or bind mounts, etc...

Will multi-thread do write() interleaved

If I have two threads, thread0 and thread1.
thread0 does:
const char *msg = "thread0 => 0000000000\n";
write(fd, msg, strlen(msg));
thread1 does:
const char *msg = "thread1 => 111111111\n";
write(fd, msg, strlen(msg));
Will the output interleave? E.g.
thread0 => 000000111
thread1 => 111111000
First, note that your question is "Will data be interleaved?", not "Are write() calls [required to be] atomic?" Those are different questions...
"TL;DR" summary:
write() to a pipe or FIFO less than or equal to PIPE_BUF bytes won't be interleaved
write() calls to anything else will be somewhere in the range between "probably won't be interleaved" to "won't ever be interleaved" with the majority of implementations in the "almost certainly won't be interleaved" to "won't ever be interleaved" range.
Full Answer
If you're writing to a pipe or FIFO, your data will not be interleaved at all for write() calls for PIPE_BUF or less bytes.
Per the POSIX standard for write() (note the bolded part):
RATIONALE
...
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
...
Applicability of POSIX standards to Windows systems, however, is debatable at best.
So, for pipes or FIFOs, data won't be interleaved up to PIPE_BUF bytes.
How does that apply to files?
First, file append operations have to be atomic. Per that same POSIX standard (again, note the bolded part):
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Also see Is file append atomic in UNIX?
So how does that apply to non-append write() calls?
Commonality of implementation. See the Linux read/write syscall implementations for an example. (Note that the "problem" is handed directly to the VFS implementation, though, so the answer might also be "It might very well depend on your file system...")
Most implementations of the write() system call inside the kernel are going to use the same code to do the actual data write for both append mode and "normal" write() calls - and for pwrite() calls, too. The only difference will be the source of the offset used - for "normal" write() calls the offset used will be the current file offset. For append write() calls the offset used will be the current end of the file. For pwrite() calls the offset used will be supplied by the caller (except that Linux is broken - it uses the current file size instead of the supplied offset parameter as the target offset for pwrite() calls on files opened in append mode. See the "BUGS" section of the Linux pwrite() man page.)
So appending data has to be atomic, and that same code will almost certainly be used for non-append write() calls in all implementations.
But the "write operation" in the append-must-be-atomic requirement is allowed to return less than the total number of bytes requested:
The write() function shall attempt to write nbyte bytes ...
Partial write() results are allowed even in append operations. But even then, the data that does get written must be written atomically.
What are the odds of a partial write()? That depends on what you're writing to. I've never seen a partial write() result to a file outside of the disk filling up or an actual hardware failure. Or even a partial read() result. I can't see any way for a write() operation that has all its data on a single page in kernel memory resulting in a partial write() in anything other than a disk full or hardware failure situation.
If you look at Is file append atomic in UNIX? again, you'll see that actual testing shows that append write() operations are in fact atomic.
So the answer to "Will multi-thread do write() interleaved?" is, "No, the data will almost certainly not be interleaved for writes that are at or under 4KB (page size) as long as the data does not cross a page boundary in kernel space." And even crossing a page boundary probably doesn't change the odds all that much.
If you're writing small chunks of data, it depends on your willingness to deal with the almost-certain-to-never-happen-but-it-might-anyway result of interleaved data. If it's a text log file, I'd opine that it won't matter anyway.
And note that it's not likely to be any faster to use multiple threads to write to the same file - the kernel is likely going to lock things and effectively single-thread the actual write() calls anyway to ensure it can meet the atomicity requirements of writing to a pipe and appending to a file.

Flushing pipe without closing in C

I have found a lot of threads in here asking about how it is possible to flush a pipe after writing to it without closing it.
In every thread I could see different suggestions but i could not find a definite solution.
Here is a quick summary:
The easiest way to avoid read blocking on the pipe would be to write the exact number of bytes that is reading.
It could be also done by using ptmx instead of a pipe but people said it could be to much.
Note: It's not possible to use fsync with pipes
Are there any other more efficient solutions?
Edit:
The flush would be convenient when the sender wants to write n characters but the client reads m characters (where m>n). The client will block waiting for another m-n characters. If the sender wants to communicate again with the client leaves him without the option of closing the pipe and just sending the exact number of bytes could be a good source of bugs.
The receiver operates like this and it cannot be modified:
while((n=read(0, buf, 100)>0){
process(buf)
so the sender wants to get processed: "file1" and "file2" for which will have to:
write(pipe[1], "file1\0*95", 100);
write(pipe[1], "file2\0*95", 100);
what I am looking is for a way to do something like that (without being necessary to use the \n as the delimeter):
write(pipe[1], "file1\nfile2", 11); //it would have worked if it was ptmx
(Using read and write)
Flushing in the sense of fflush() is irrelevant to pipes, because they are not represented as C streams. There is therefore no userland buffer to flush. Similarly, fsync() is also irrelevant to pipes, because there is no back-end storage for the data. Data successfully written to a pipe exist in the kernel and only in the kernel until they are successfully read, so there is no work for fsync() to do. Overall, flushing / syncing is applicable only where there is intermediate storage involved, which is not the case with pipes.
With the clarification, your question seems to be about establishing message boundaries for communication via a pipe. You are correct that closing the write end of the pipe will signal a boundary -- not just of one message, but of the whole stream of communication -- but of course that's final. You are also correct that there are no inherent message boundaries. Nevertheless, you seem to be working from at least somewhat of a misconception here:
The easiest way to avoid read blocking on the pipe would be to write
the exact number of bytes that is reading.
[...]
The flush would be convenient when the sender wants to write n
characters but the client reads m characters (where m>n). The client
will block waiting for another m-n characters.
Whether the reader will block is entirely dependent on how the reader is implemented. In particular, the read(2) system call in no way guarantees to transfer the requested number of bytes before returning. It can and will perform a short read under some circumstances. Although the details are unspecified, you can ordinarily rely on a short read when at least one character can be transferred without blocking, but not the whole number requested. Similar applies to write(2). Thus, the easiest way to avoid read() blocking is to ensure that you write at least one byte to the pipe for that read() call to transfer.
In fact, people usually come at this issue from the opposite direction: needing to be certain to receive a specific number of bytes, and therefore having to deal with the potential for short reads as a complication (to be dealt with by performing the read() in a loop). You'll need to consider that, too, but you have the benefit that your client is unlikely to block under the circumstances you describe; it just isn't the problem you think it is.
There is an inherent message-boundary problem in any kind of stream communication, however, and you'll need to deal with it. There are several approaches; among the most commonly used are
Fixed-length messages. The receiver can then read until it successfully transfers the required number of bytes; any blocking involved is appropriate and needful. With this approach, the scenario you postulate simply does not arise, but the writer might need to pad its messages.
Delimited messages. The receiver then reads until it finds that it has received a message delimiter (a newline or a null byte, for example). In this case, the receiver will need to be prepared for the possibility of message boundaries not being aligned with the byte sequences transferred by read() calls. Marking the end of a message by closing the channel can be considered a special case of this alternative.
Embedded message-length metadata. This can take many forms, but one of the simplest is to structure messages as a fixed-length integer message length field, followed by that number of bytes of message data. The reader then knows at every point how many bytes it needs to read, so it will not block needlessly.
These can be used individually or in combination to implement an application-layer protocol for communicating between your processes. Naturally, the parties to the communication must agree on the protocol details for communication to be successful.

are fread and fwrite different in handling the internal buffer?

I keep on reading that fread() and fwrite() are buffered library calls. In case of fwrite(), I understood that once we write to the file, it won't be written to the hard disk, it will fill the internal buffer and once the buffer is full, it will call write() system call to write the data actually to the file.
But I am not able to understand how this buffering works in case of fread(). Does buffered in case of fread() mean, once we call fread(), it will read more data than we originally asked and that extra data will be stored in buffer (so that when 2nd fread() occurs, it can directly give it from buffer instead of going to hard disk)?
And I have following queries also.
If fread() works as I mention above, then will first fread() call read the data that is equal to the size of the internal buffer? If that is the case, if my fread() call ask for more bytes than internal buffer size, what will happen?
If fread() works as I mention above, that means at least one read() system call to kernel will happen for sure in case of fread(). But in case of fwrite(), if we only call fwrite() once during the program execution, we can't say for sure that write() system call be called. Is my understanding correct?
Will the internal buffer be maintained by OS?
Does fclose() flush the internal buffer?
There is buffering or caching at many different levels in a modern system. This might be typical:
C standard library
OS kernel
disk controller (esp. if using hardware RAID)
disk drive
When you use fread(), it may request 8 KB or so if you asked for less. This will be stored in user-space so there is no system call and context switch on the next sequential read.
The kernel may read ahead also; there are library functions to give it hints on how to do this for your particular application. The OS cache could be gigabytes in size since it uses main memory.
The disk controller may read ahead too, and could have a cache size up to hundreds of megabytes on smallish systems. It can't do as much in terms of read-ahead, because it doesn't know where the next logical block is for the current file (indeed it doesn't even know what file it is reading).
Finally, the disk drive itself has a cache, perhaps 16 MB or so. Like the controller, it doesn't know what file it is reading. For many years one disk block was 512 bytes, but it got a little larger (a few KB) recently with multi-terabyte disks.
When you call fclose(), it will probably deallocate the user-space buffer, but not the others.
Your understanding is correct. And any buffered fwrite data will be flushed when the FILE* is closed. The buffered I/O is mostly transparent for I/O on regular files.
But for terminals and other character devices you may care. Another instance where buffered I/O may be an issue is if you read from the file that one process is writing to from another process -- a common example is if a program writes text to a log file during operation, and the user runs a command like tail -f program.log to watch the content of the log file live. If the writing process has buffering enabled and it doesn't explicitly flush the log file, it will make it difficult to monitor the log file.

read() from files - blocking vs. non-blocking behavior

Let's assume we opened a file using fopen() and from the file-pointer received, fetch the file-descriptor using fileno(). Then we do lots (>10^8) of random read()s of relativly small chunks, between a size of 4Bytes to 10KBytes from this file:
Is it expected behaviour such a read() might return less bytes then requested, without setting errno, if the file-system is an
ext3
NFS
OCFS2
combination of 2 and 3 (OCFS2 via NFS)
?
My readings gave me the conclusion it should not be possible for 1. (if the file has not O_NONBLOCK set, if ever possible for ext3 to have it set) but for the other three (2., 3., 4.) I'm uncertain.
(Btw: Could I assume having O_NONBLOCK not set to be the default in any case?)
This questions arose because I observed read()s returning less bytes then requested without errno set in case 4.
The problem to drill this down by testing is that such behaviour happens in <1/1000000000 cases ... - which is still too often :-}
Update: The average file size is between some TBytes and around 1GByte.
You should not assume that read() will not return less bytes than requested for any filesystem. This is particularly true in the case of large reads, as POSIX.1 indicates that read() behavior for sizes larger than SSIZE_MAX is implementation-dependent. On this mainstream Unix box I'm using right now, SSIZE_MAX is 32767 bytes. The fact that read() always returns the full amount today does not mean that it will in the future.
One possible reason might be that I/O priorities are more fully fleshed out in the kernel in the future. E.g. you're trying to read from the same device as another higher priority process and the other process would get better throughput if your process wasn't causing head movement away from the sectors the other process wants. The kernel might choose to give your read() a short count to get you out of the way for a while, instead of continuing to do inefficient interleaved block reads. Stranger things have been done for the sake of I/O efficiency. What is not prohibited often becomes compulsory.
We solved the problem described as having read() return less bytes then request when reading from a file located on a NFS mount, pointing to an OCFS2 file system (case 4 in my question).
It is a fact that using the setup mentioned above, such read()s on file descriptors sometimes return less bytes then requested, without having errno set.
To have all data read it is as simple as just read()ing again and again up until the amount of data requested had been read.
Moreover such setup sometimes makes read() fail with EIO, and even then a simple re-read() leads to success and data arrives.
My conclusion: Reading via OCFS2 via NFS makes read()ing from files behave like read()ing from sockets which is inconsistent with the specifications of read() http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html :
When attempting to read a file (other than a pipe or FIFO) that
supports non-blocking reads and has no data currently available:
If O_NONBLOCK is set, read() shall return -1 and set errno to [EAGAIN].
If O_NONBLOCK is clear, read() shall block the calling thread until some data becomes available.
No need to say we never ever tried, nor even thought about to set O_NONBLOCK for the file descriptors in question.

Resources