I write about 50k bytes data to a file (which is stored in a USB disk and mount on linux 2.6.37. FAT32 ) which using O_NOBLOCK every 200 ms.Whether the write() function has any risk of returning a EAGAIN.If yes, why and in what case. I run the program already half an hour, and no error return has been reported.
Copy of correct-but-deleted answer:
No. The O_NONBLOCK flag doesn't affect working with regular files.
Some reference, for completeness:
That aplies only to pipes; for regular files, it's ignored.
If the O_NONBLOCK flag is clear, a write request may cause the thread to block, but on normal completion it shall return nbyte.
If the O_NONBLOCK flag is set, write() requests shall be handled differently, in the following ways:
The write() function shall not block the thread.
A write request for {PIPE_BUF} or fewer bytes shall have the following effect: if there is sufficient space available in the pipe, write() shall transfer all the data and return the number of bytes requested. Otherwise, write() shall transfer no data and return -1 with errno set to [EAGAIN].
A write request for more than {PIPE_BUF} bytes shall cause one of the following:
When at least one byte can be written, transfer what it can and return the number of bytes written. When all data previously written to the pipe is read, it shall transfer at least {PIPE_BUF} bytes.
When no data can be written, transfer no data, and return -1 with errno set to [EAGAIN].
Related
If I have two threads, thread0 and thread1.
thread0 does:
const char *msg = "thread0 => 0000000000\n";
write(fd, msg, strlen(msg));
thread1 does:
const char *msg = "thread1 => 111111111\n";
write(fd, msg, strlen(msg));
Will the output interleave? E.g.
thread0 => 000000111
thread1 => 111111000
First, note that your question is "Will data be interleaved?", not "Are write() calls [required to be] atomic?" Those are different questions...
"TL;DR" summary:
write() to a pipe or FIFO less than or equal to PIPE_BUF bytes won't be interleaved
write() calls to anything else will be somewhere in the range between "probably won't be interleaved" to "won't ever be interleaved" with the majority of implementations in the "almost certainly won't be interleaved" to "won't ever be interleaved" range.
Full Answer
If you're writing to a pipe or FIFO, your data will not be interleaved at all for write() calls for PIPE_BUF or less bytes.
Per the POSIX standard for write() (note the bolded part):
RATIONALE
...
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
...
Applicability of POSIX standards to Windows systems, however, is debatable at best.
So, for pipes or FIFOs, data won't be interleaved up to PIPE_BUF bytes.
How does that apply to files?
First, file append operations have to be atomic. Per that same POSIX standard (again, note the bolded part):
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Also see Is file append atomic in UNIX?
So how does that apply to non-append write() calls?
Commonality of implementation. See the Linux read/write syscall implementations for an example. (Note that the "problem" is handed directly to the VFS implementation, though, so the answer might also be "It might very well depend on your file system...")
Most implementations of the write() system call inside the kernel are going to use the same code to do the actual data write for both append mode and "normal" write() calls - and for pwrite() calls, too. The only difference will be the source of the offset used - for "normal" write() calls the offset used will be the current file offset. For append write() calls the offset used will be the current end of the file. For pwrite() calls the offset used will be supplied by the caller (except that Linux is broken - it uses the current file size instead of the supplied offset parameter as the target offset for pwrite() calls on files opened in append mode. See the "BUGS" section of the Linux pwrite() man page.)
So appending data has to be atomic, and that same code will almost certainly be used for non-append write() calls in all implementations.
But the "write operation" in the append-must-be-atomic requirement is allowed to return less than the total number of bytes requested:
The write() function shall attempt to write nbyte bytes ...
Partial write() results are allowed even in append operations. But even then, the data that does get written must be written atomically.
What are the odds of a partial write()? That depends on what you're writing to. I've never seen a partial write() result to a file outside of the disk filling up or an actual hardware failure. Or even a partial read() result. I can't see any way for a write() operation that has all its data on a single page in kernel memory resulting in a partial write() in anything other than a disk full or hardware failure situation.
If you look at Is file append atomic in UNIX? again, you'll see that actual testing shows that append write() operations are in fact atomic.
So the answer to "Will multi-thread do write() interleaved?" is, "No, the data will almost certainly not be interleaved for writes that are at or under 4KB (page size) as long as the data does not cross a page boundary in kernel space." And even crossing a page boundary probably doesn't change the odds all that much.
If you're writing small chunks of data, it depends on your willingness to deal with the almost-certain-to-never-happen-but-it-might-anyway result of interleaved data. If it's a text log file, I'd opine that it won't matter anyway.
And note that it's not likely to be any faster to use multiple threads to write to the same file - the kernel is likely going to lock things and effectively single-thread the actual write() calls anyway to ensure it can meet the atomicity requirements of writing to a pipe and appending to a file.
Consider the following invocation of read() on a nonblocking stream-mode socket (SOCK_STREAM):
ssize_t n = read(socket_fd, buffer, size);
Assume that the remote peer will not close the connection, and will not shut down its writing half of the connection (the reading half, from a local point of view).
On Linux, a short read (n > 0 && n < size) under these circumstances means that the kernel-level read buffer has been exhausted, and an immediate follow-up invocation would normally fail with EAGAIN/EWOULDBLOCK (it would fail unless new data manages to arrive in between the two calls).
In other words, on Linux, an invocation of read() will always consume everything that is immediately available provided that size is large enough.
Likewise for write(), on Linux a short write always means that the kernel-level buffer was filled, and an immediate follow-up invocation is likely to fail with EAGAIN/EWOULDBLOCK.
Question 1: Is this also guaranteed on macOS/OSX?
Question 2: Is this also guaranteed on FreeBSD?
Question 3: Is this required/guaranteed by POSIX?
I know this is true on Linux, because of the following note in the manual page for epoll (section 7):
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
EDIT: As a motivation for the question, consider a case where you want to process input on a number of sockets simultaneously, and for whatever reason, you want to do this by fully exhausting in-kernel buffers for each socket in turn (i.e., "depth first" rather than "breadth first"). This can obviously be done by repeating a read on a ready-ready socket until it fails with EAGAIN/EWOULDBLOCK, but the last invocation would be redundant if the previous read was short, and we knew that a short read was a guarantee of exhaustion.
It is guaranteed by Posix:
data shall be returned to the user as soon as it becomes available.
... and therefore on all the other platforms you mention as well, and also Windows, OS/2, NetWare, ...
Any other implementation would be pointless.
the docs say for send:
When the message does not fit into the send buffer of the socket,
send() normally blocks, unless the socket has been placed in non-block-
ing I/O mode. In non-blocking mode it would return EAGAIN in this
case. The select(2) call may be used to determine when it is possible
to send more data.
I am in blocking mode, doing something along the lines of:
buf = malloc(size);
send (socket, buf, size);
free(buf)
Assume but is very large, larger than the buffer can hold at a time (so it would need to go into the buffer as two chunks lets say). Anyways, in blocking mode, which I'm in, after send, can I feel safe that the data is fully copied or dealt with and thus deletable?
In blocking mode, send blocks until I/O is complete, or an error is triggered. You should check the returned value, because a send operation does not guarantee that the number of bytes sent is the same number of bytes passed as third argument.
Only when send returns a value equal to the size of the buffer sent you can be sure that the whole block has been copied into kernel memory, or passed through device memory, or sent to the destination.
The short answer is: Yes, you can free the buffer after the send() call successfully returns (without errors) when the file descriptor is in blocking mode.
The reason for this is based on the blocking concept itself: The send() call (targeting a blocking file descriptor) will only return when an error occur or the requested size bytes of the data in the buf is buffered or transmitted by the underlying layer of the operating system (typically the kernel).
Also note that a successful return of send() doesn't mean that the data was transmitted. It means that it was, at least, buffered by the underlying layer.
Let's assume we opened a file using fopen() and from the file-pointer received, fetch the file-descriptor using fileno(). Then we do lots (>10^8) of random read()s of relativly small chunks, between a size of 4Bytes to 10KBytes from this file:
Is it expected behaviour such a read() might return less bytes then requested, without setting errno, if the file-system is an
ext3
NFS
OCFS2
combination of 2 and 3 (OCFS2 via NFS)
?
My readings gave me the conclusion it should not be possible for 1. (if the file has not O_NONBLOCK set, if ever possible for ext3 to have it set) but for the other three (2., 3., 4.) I'm uncertain.
(Btw: Could I assume having O_NONBLOCK not set to be the default in any case?)
This questions arose because I observed read()s returning less bytes then requested without errno set in case 4.
The problem to drill this down by testing is that such behaviour happens in <1/1000000000 cases ... - which is still too often :-}
Update: The average file size is between some TBytes and around 1GByte.
You should not assume that read() will not return less bytes than requested for any filesystem. This is particularly true in the case of large reads, as POSIX.1 indicates that read() behavior for sizes larger than SSIZE_MAX is implementation-dependent. On this mainstream Unix box I'm using right now, SSIZE_MAX is 32767 bytes. The fact that read() always returns the full amount today does not mean that it will in the future.
One possible reason might be that I/O priorities are more fully fleshed out in the kernel in the future. E.g. you're trying to read from the same device as another higher priority process and the other process would get better throughput if your process wasn't causing head movement away from the sectors the other process wants. The kernel might choose to give your read() a short count to get you out of the way for a while, instead of continuing to do inefficient interleaved block reads. Stranger things have been done for the sake of I/O efficiency. What is not prohibited often becomes compulsory.
We solved the problem described as having read() return less bytes then request when reading from a file located on a NFS mount, pointing to an OCFS2 file system (case 4 in my question).
It is a fact that using the setup mentioned above, such read()s on file descriptors sometimes return less bytes then requested, without having errno set.
To have all data read it is as simple as just read()ing again and again up until the amount of data requested had been read.
Moreover such setup sometimes makes read() fail with EIO, and even then a simple re-read() leads to success and data arrives.
My conclusion: Reading via OCFS2 via NFS makes read()ing from files behave like read()ing from sockets which is inconsistent with the specifications of read() http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html :
When attempting to read a file (other than a pipe or FIFO) that
supports non-blocking reads and has no data currently available:
If O_NONBLOCK is set, read() shall return -1 and set errno to [EAGAIN].
If O_NONBLOCK is clear, read() shall block the calling thread until some data becomes available.
No need to say we never ever tried, nor even thought about to set O_NONBLOCK for the file descriptors in question.
The read and write functions (and relatives like send, recv, readv, ...) can return a number of bytes less than the requested read/write length if interrupted by a signal (under certain circumstances), and perhaps in other cases too. Is there a well-defined set of conditions for when this can happen, or is it largely up to the implementation? Here are some particular questions I'm interested in the answers to:
If a signal handler is non-interrupting (SA_RESTART) that will cause IO operations interrupted before any data is transferred to be restarted after the signal handler returns. But if a partial read/write has already occurred and the signal handler is non-interrupting, will the syscall return immediately with the partial length, or will it be resumed attempting to read/write the remainder?
Obviously read functions can return short reads on network, pipe, and terminal file descriptors when less data than the requested amount is available. But can write functions return short writes in these cases due to limited buffer size, or will they block until all the data can be written?
I'd be interested in all three of standards-required, common, and Linux-specific behavior.
For your second question : write can return short writes for a limited buffer size if it is non-blocking
There's at least one standard condition that can cause write on a regular file to return a short size:
If a write() requests that more bytes
be written than there is room for (for
example, [XSI] the file size limit
of the process or the physical end of
a medium), only as many bytes as there
is room for shall be written. For
example, suppose there is space for 20
bytes more in a file before reaching a
limit. A write of 512 bytes will
return 20. The next write of a
non-zero number of bytes would give a
failure return (except as noted
below).