Understanding read syscall - c

I'm reading man read manual page and discovered that it was possible to read less then the desired number of bytes passed in as a parameter:
It is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
termi‐nal), or because read() was interrupted by a signal.
I have the following situation:
Some process moved a file into a directory I'm listening to IN_MOVED_TO inotify events.
I receive a IN_MOVED_TO event, open a file and start reading it till the EOF is reached
No other processes modify the moved at 1. file (After it is moved it is left unchanged all the time)
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0? I mean the situation like 'reading 1 000 000 000 by a single bytes for a gigabyte file' is forbidden by the documentation

Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0?
No, not in practice. It should be true if the file system is entirely POSIX compliant, but many of them are not (in corner cases). In particular NFS (see nfs(5)) and FUSE or proc (see proc(5)) are not exactly POSIX compliant.
So in practice I strongly recommend handling the "read returns a smaller number of bytes than wanted case", even if you are right to believe that it should not happen. Handling that "impossible" case should be easy for you.
Notice also that inotify(7) facilities don't work with bizarre filesystems like NFS, proc, FUSE, ... Think also of corner cases like, inside an Ext4 file system, a symlink to an NFS file,; or bind mounts, etc...

Related

read, fread partial reads

I can't seem to find info about this in the documentation.
The read system call documentation says it may read less than specified.
Does readattempt to read several times?
I know that fread is a wrapper for read. When I invoke fread, is it possible that it will read from the stream several times until it gets 0 or reads specified bytes, or will it only attempt to read once?
I am reading from a char device created in my kernel module, it transfers info from a data structure and supports partial reads. I am interested in reading all of the data until it returns 0.
thanks
The general idea of read is that it returns as soon as some data is available¹. From an application's perspective, that's all you can assume.
If you're implementing the read callback in a kernel driver, it's up to you when read decides to return some data. But applications will² expect that read calls may be partial, and they should call read in a loop if they really need a certain number of bytes. Some applications want read not to block, so it would be a bad idea to block in a read call if some data is available.
The fread function blocks until it's read as many bytes as were requested, until it's reached the end of the file, or until an error occurs. It works by calling read in a loop.
¹ Whether and when read may return 0 bytes is beyond the scope of this answer.
² Or at least should. Buggy applications do exist.

Amount of data read() syscall will actually read

Suppose I have a file for which the file descriptor has more than n bytes left until EOF, and I invoke the read() syscall for n bytes. Is the function guaranteed to read n bytes into the buffer? Or can it read less?
The read system call is guaranteed to read as many many characters as you asked for, except when it can't. But it turns out that there are so many exceptions -- so many cases where it can't read as many characters as you asked for -- that it basically ends up being safest to assume that any given read call probably won't read as many characters as you asked for. I believe it's good practice to always write your code with that in mind.
The man page on my system says
The system guarantees to read the number of bytes requested if the descriptor references a normal file that has that many bytes left before the end-of-file, but in no other case.
So if it's not a normal file, or if it is a normal file but there aren't enough characters, you'll get fewer than you asked for. But in the case you asked about, yes, you should be guaranteed to get exactly as many characters as you asked for.
With that said, though, if you find yourself with a choice between assuming that read is allegedly guaranteed to read exactly the number of characters requested, versus acknowledging that it might return less, I would always write the code to assume it might return less. That is, if you have a call like
r = read(fd, buf, n);
there isn't usually much to be gained by assuming that if r is greater than 0, it must be exactly n. Your code has to be able to handle the r < n case so it will behave properly when it's almost at end-of-file, so unless you want to have two different code paths (one for "normal" reads, and one for the last read), you might as well write one piece of code, that can handle the r < n case, and let it operate all the time.
(Also, as Zan Lynx reminds in a comment, don't have the code notice that r < n, and infer from that that end-of-file is coming up soon. Wait for r == 0 before deciding you're at end-of-file.)
You could've read it from the man page yourself:
On Linux, read() (and similar system calls) will transfer at most
0x7ffff000 (2,147,479,552) bytes, returning the number of bytes
actually transferred. (This is true on both 32-bit and 64-bit
systems.)
So even if you had enough RAM and so on, you couldn't read a full-size DVD image in one go - however, this wouldn't be the sane thing to do either; to access such large files, mmap would be better.
Other than that, a signal might be delivered, which can cause exit with EINTR and buffer contents indeterminate.
ERRORS
[...]
EINTR The call was interrupted by a signal before any data was read; see signal(7).
Is the function guaranteed to read n bytes into the buffer? Or can it
read less?
No, even if your file has more than n bytes before its end, the read(fd, buf, n) function is not guaranteed to read n bytes into bufffer and then return n. It can read less and return a positive value that is less than n.
See Linux man page at https://man7.org/linux/man-pages/man2/read.2.html
RETURN VALUE
It is not an error if this number is smaller than the number of
bytes requested; this may happen for example because fewer bytes
are actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
terminal), or because read() was interrupted by a signal.

read system in Linux vs Windows

Is there any difference between using read() in Linux than in Windows?
Is it possible that in Windows, it will usually read less than I request, and in Linux it usually reads as much as I request?
read isn't a standard c function. Historically it is a posix syscall, and as such, windows (assuming windows means MSVC) isn't required to implement it at all. Still, they tried. And we can compare the two implementations:
linux:
http://man7.org/linux/man-pages/man2/read.2.html
On success, the number of bytes read is returned (zero indicates end
of file), and the file position is advanced by this number. It is
not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. See also NOTES.
windows:
https://msdn.microsoft.com/en-us/library/ms235412.aspx
https://msdn.microsoft.com/en-us/library/wyssk1bs.aspx
_read returns the number of bytes read, which might be less than count if there are fewer than count bytes left in the file or if the file was opened in text mode, in which case each carriage return–line feed (CR-LF) pair is replaced with a single linefeed character. Only the single linefeed character is counted in the return value. The replacement does not affect the file pointer.
So you should expect both implementations to return less than the requested number of bytes. Futhermore, there is a clear difference when reading files in text mode.

read() from files - blocking vs. non-blocking behavior

Let's assume we opened a file using fopen() and from the file-pointer received, fetch the file-descriptor using fileno(). Then we do lots (>10^8) of random read()s of relativly small chunks, between a size of 4Bytes to 10KBytes from this file:
Is it expected behaviour such a read() might return less bytes then requested, without setting errno, if the file-system is an
ext3
NFS
OCFS2
combination of 2 and 3 (OCFS2 via NFS)
?
My readings gave me the conclusion it should not be possible for 1. (if the file has not O_NONBLOCK set, if ever possible for ext3 to have it set) but for the other three (2., 3., 4.) I'm uncertain.
(Btw: Could I assume having O_NONBLOCK not set to be the default in any case?)
This questions arose because I observed read()s returning less bytes then requested without errno set in case 4.
The problem to drill this down by testing is that such behaviour happens in <1/1000000000 cases ... - which is still too often :-}
Update: The average file size is between some TBytes and around 1GByte.
You should not assume that read() will not return less bytes than requested for any filesystem. This is particularly true in the case of large reads, as POSIX.1 indicates that read() behavior for sizes larger than SSIZE_MAX is implementation-dependent. On this mainstream Unix box I'm using right now, SSIZE_MAX is 32767 bytes. The fact that read() always returns the full amount today does not mean that it will in the future.
One possible reason might be that I/O priorities are more fully fleshed out in the kernel in the future. E.g. you're trying to read from the same device as another higher priority process and the other process would get better throughput if your process wasn't causing head movement away from the sectors the other process wants. The kernel might choose to give your read() a short count to get you out of the way for a while, instead of continuing to do inefficient interleaved block reads. Stranger things have been done for the sake of I/O efficiency. What is not prohibited often becomes compulsory.
We solved the problem described as having read() return less bytes then request when reading from a file located on a NFS mount, pointing to an OCFS2 file system (case 4 in my question).
It is a fact that using the setup mentioned above, such read()s on file descriptors sometimes return less bytes then requested, without having errno set.
To have all data read it is as simple as just read()ing again and again up until the amount of data requested had been read.
Moreover such setup sometimes makes read() fail with EIO, and even then a simple re-read() leads to success and data arrives.
My conclusion: Reading via OCFS2 via NFS makes read()ing from files behave like read()ing from sockets which is inconsistent with the specifications of read() http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html :
When attempting to read a file (other than a pipe or FIFO) that
supports non-blocking reads and has no data currently available:
If O_NONBLOCK is set, read() shall return -1 and set errno to [EAGAIN].
If O_NONBLOCK is clear, read() shall block the calling thread until some data becomes available.
No need to say we never ever tried, nor even thought about to set O_NONBLOCK for the file descriptors in question.

"short read" from filesystem, when can it happen?

It is obvious that in general the read(2) system call can return less bytes than what was asked to be read. However, quite a few programs assume that when working with a local files, read(2) never returns less than what was asked (unless the file is shorter, of course).
So, my question is: on Linux, in which cases can read(2) return less than what was requested if reading from an open file and EOF is not encountered and the amount being read is a few kilobytes at maximum?
Some guesses:
Can received signals interrupt a read like that, but not make it fail?
Can different filesystems affect this behavior? Is there anything special about jffs2?
POSIX.1-2008 states:
The value returned may be less than
nbyte if the number of bytes left in
the file is less than nbyte, if the
read() request was interrupted by a
signal, or if the file is a pipe or
FIFO or special file and has fewer
than nbyte bytes immediately available
for reading.
Disk-based filesystems generally use uninterruptible reads, which means that the
read operation generally cannot be interrupted by a signal. Network-based
filesystems sometimes use interruptible reads, which can return partial data or no data.
(In the case of NFS this is configurable using the intr mount option.)
They sometimes also implement timeouts.
Keep in mind that even /some/arbitrary/file/path may refer to a FIFO or
special file, so what you thought was a regular file may not be. It is therefore
good practice to handle partial reads even though they may be unlikely.
I have to ask: "why do you care about the reason"? If read can return a number of bytes less than the requested amount (which, as you point out, it certainly can) why would you not want to deal with that situation?
A received signal only makes read() fail if it hasn't yet read a single byte. Otherwise, it will return partial data.
And I guess alternate filesystems may indeed return short reads in other situations. For example, it makes some sense (to me) to have a network-based filesystem behave just like a network socket wrt short reads (= having them often).
If it's really a file you are reading, then you can get short read as the last read before end of file.
Howver, it's generally best to behave as if ANY read could be a short read. If what you are reading is a pipe or an input device (stdin) rather than a file, you can get a short read whenever your buffer is larger than what is currently in the input buffer.
I am not sure but this situation could arise when the OS is running out of pages in the page cache. You could suggest that flush thread will be invoked in that case, but it depends on the heuristic used in the I/O scheduler. This situation could cause a read to return fewer bytes.
What I have always read being called a "short read" is not related to the file access read(2) but to the physical read of a disk sector. It happens when, while reading the data part of the sector, less valid magnetic signals are found than to make the 512 (or 4096 or whatever) bytes of a sector. That makes an invalid sector and a read fault. Regarding "when", or rather why it happens is most probably because the power feeding the drive fell down while that sector was written.
Could it be that a read(2) ends with a physical error code called "short read"?

Resources