Linux read function implementation - c

I wanted to know how does read() function work when a socket descriptor is passed to it and when a file descriptor is passed to it. In case of file descriptor, it always returns n bytes as specified, or less if there are no n bytes. However in case of a socket descriptor, it's not necessary it will return n bytes. So in order to make sure if we have received n bytes, we'll have to put an application logic and keep count of how many bytes we have received and terminate when the count is n. My question is, why don't we have to put an application logic when we are reading from a file?

Read read(2) man page:
man 2 read
You'll better assume that it always may return a byte count less that the entire buffer you passed to it (in particular, because it could be difficult to know if the file descriptor refers to a socket, a tty, some other device, a pipe, a fifo, or some plain file, and also because you could have some file systems with non-POSIX compliant semantics). You also might have reached the end of file (EOF), etc...
For TCP sockets, remember that they only are a stream of bytes, and a given single send may be received in several reads, etc etc... In particular, message chunks could be split/reassembled by "the network" (e.g. routers).
For plain files, remember that some other process could change it (e.g. write into it, truncate it, etc..) while your process is reading it.

Related

is there an official document that mark read/write function as thread-safe functions?

the man pages of read/write didn't mention anything about their thread-safety
According to this link!
i understood this functions are thread safe but in this comment there is not a link to an official document.
In other hand according to this link! which says:
The read() function shall attempt to read nbyte bytes
from the file associated with the open file descriptor,
fildes, into the buffer pointed to by buf.
The behavior of multiple concurrent reads on the same pipe, FIFO, or
terminal device is unspecified.
I concluded the read function is not thread safe.
I am so confused now. please send me a link to official document about thread-safety of this functions.
i tested this functions with pipe but there wasn't any problem.(of course i know i couldn't state any certain result by testing some example)
thanks in advance:)
The thread safe versions of read and write are pread and pwrite:
pread(2)
The pread() and pwrite() system calls are especially useful in
multithreaded applications. They allow multiple threads to perform
I/O on the same file descriptor without being affected by changes to
the file offset by other threads.
when two threads write() at the same time the order is not specified (which write call completes first) therefore the behaviour is unspecified (without synchronization)
read() and write() are not strictly thread-safe, and there is no documentation that says they are, as the location where the data is read from or written to can be modified by another thread.
Per the POSIX read documentation (note the bolded parts):
The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer pointed to by buf. The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
That's the part you noticed - but that does not cover all possible types of file descriptors, such as regular files. It only applies to "pipe[s], FIFO[s]" and "terminal device[s]". This part covers almost everything else (weird things like "files" in /proc that are generated on the fly by the kernel are, well, weird and highly implementation-specific):
On files that support seeking (for example, a regular file), the read() shall start at a position in the file given by the file offset associated with fildes. The file offset shall be incremented by the number of bytes actually read.
Since the "file offset associated with fildes" is subject to modification from other threads in the process, the following code is not guaranteed to return the same results even given the exact same file contents and inputs for fd, offset, buffer, and bytes:
lseek( fd, offset, SEEK_SET );
read( fd, buffer, bytes );
Since both read() and write() depend upon a state (current file offset) that can be modified at any moment by another thread, they are not tread-safe.
On some embedded file systems, or really old desktop systems that weren't designed to facilitate multitasking support (e.g. MS-DOS 3.0), an attempt to perform an fread() on one file while an fread() is being performed on another file may result in arbitrary system corruption.
Any modern operating system and language runtime will guarantee that such corruption won't occur as a result of operations performed on unrelated files, or when independent file descriptors are used to access the same file in ways that do not modify it. Functions like fread() and fwrite() will be thread-safe when used in that fashion.
The act of reading data from a disk file does not modify it, but reading data from many kinds of stream will modify them by removing data. If two threads both perform actions that modify the same stream, such actions may interfere with each other in unspecified ways even if such modifications are performed by fread() operations.

Can I determine how many bytes are in the stdio userspace read buffer associated with a FILE?

I'm writing a C program that connects to another machine over a TCP socket and reads newline-delimited text over that TCP connection.
I use poll to check whether data is available on the file descriptor associated with the socket, and then I read characters into a buffer until I get a newline. However, to make that character-by-character read efficient, I'm using a stdio FILE instead of using the read system call.
When more than one short line of input arrives over the socket quickly, my current approach has a bug. When I start reading characters, stdio buffers several lines of data in userspace. Once I've read one line and processed it, I then poll the socket file descriptor again to determine whether there is more data to read.
Unfortunately, that poll (and fstat, and every other method I know to get the number of bytes in a file) don't know about any leftover data that is buffered in userspace as part of the FILE. This results in my program blocking on that poll when it should be consuming data that has been buffered into userspace.
How can I check how much data is buffered in userspace? The specs specifically tell you not to rely on setvbuf for this purpose (the representation format is undefined), so I'm hoping for another option.
Right now, it seems like my best option is to implement my own userspace buffering where I have control over this, but I wanted to check before going down that road.
EDIT:
Some comments did provide a way to test if there is at least one character available by setting the file to be nonblocking and trying to fgetc/fungetc a single character, but this can't tell you how many bytes are available.

Will multi-thread do write() interleaved

If I have two threads, thread0 and thread1.
thread0 does:
const char *msg = "thread0 => 0000000000\n";
write(fd, msg, strlen(msg));
thread1 does:
const char *msg = "thread1 => 111111111\n";
write(fd, msg, strlen(msg));
Will the output interleave? E.g.
thread0 => 000000111
thread1 => 111111000
First, note that your question is "Will data be interleaved?", not "Are write() calls [required to be] atomic?" Those are different questions...
"TL;DR" summary:
write() to a pipe or FIFO less than or equal to PIPE_BUF bytes won't be interleaved
write() calls to anything else will be somewhere in the range between "probably won't be interleaved" to "won't ever be interleaved" with the majority of implementations in the "almost certainly won't be interleaved" to "won't ever be interleaved" range.
Full Answer
If you're writing to a pipe or FIFO, your data will not be interleaved at all for write() calls for PIPE_BUF or less bytes.
Per the POSIX standard for write() (note the bolded part):
RATIONALE
...
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
...
Applicability of POSIX standards to Windows systems, however, is debatable at best.
So, for pipes or FIFOs, data won't be interleaved up to PIPE_BUF bytes.
How does that apply to files?
First, file append operations have to be atomic. Per that same POSIX standard (again, note the bolded part):
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Also see Is file append atomic in UNIX?
So how does that apply to non-append write() calls?
Commonality of implementation. See the Linux read/write syscall implementations for an example. (Note that the "problem" is handed directly to the VFS implementation, though, so the answer might also be "It might very well depend on your file system...")
Most implementations of the write() system call inside the kernel are going to use the same code to do the actual data write for both append mode and "normal" write() calls - and for pwrite() calls, too. The only difference will be the source of the offset used - for "normal" write() calls the offset used will be the current file offset. For append write() calls the offset used will be the current end of the file. For pwrite() calls the offset used will be supplied by the caller (except that Linux is broken - it uses the current file size instead of the supplied offset parameter as the target offset for pwrite() calls on files opened in append mode. See the "BUGS" section of the Linux pwrite() man page.)
So appending data has to be atomic, and that same code will almost certainly be used for non-append write() calls in all implementations.
But the "write operation" in the append-must-be-atomic requirement is allowed to return less than the total number of bytes requested:
The write() function shall attempt to write nbyte bytes ...
Partial write() results are allowed even in append operations. But even then, the data that does get written must be written atomically.
What are the odds of a partial write()? That depends on what you're writing to. I've never seen a partial write() result to a file outside of the disk filling up or an actual hardware failure. Or even a partial read() result. I can't see any way for a write() operation that has all its data on a single page in kernel memory resulting in a partial write() in anything other than a disk full or hardware failure situation.
If you look at Is file append atomic in UNIX? again, you'll see that actual testing shows that append write() operations are in fact atomic.
So the answer to "Will multi-thread do write() interleaved?" is, "No, the data will almost certainly not be interleaved for writes that are at or under 4KB (page size) as long as the data does not cross a page boundary in kernel space." And even crossing a page boundary probably doesn't change the odds all that much.
If you're writing small chunks of data, it depends on your willingness to deal with the almost-certain-to-never-happen-but-it-might-anyway result of interleaved data. If it's a text log file, I'd opine that it won't matter anyway.
And note that it's not likely to be any faster to use multiple threads to write to the same file - the kernel is likely going to lock things and effectively single-thread the actual write() calls anyway to ensure it can meet the atomicity requirements of writing to a pipe and appending to a file.

Flushing pipe without closing in C

I have found a lot of threads in here asking about how it is possible to flush a pipe after writing to it without closing it.
In every thread I could see different suggestions but i could not find a definite solution.
Here is a quick summary:
The easiest way to avoid read blocking on the pipe would be to write the exact number of bytes that is reading.
It could be also done by using ptmx instead of a pipe but people said it could be to much.
Note: It's not possible to use fsync with pipes
Are there any other more efficient solutions?
Edit:
The flush would be convenient when the sender wants to write n characters but the client reads m characters (where m>n). The client will block waiting for another m-n characters. If the sender wants to communicate again with the client leaves him without the option of closing the pipe and just sending the exact number of bytes could be a good source of bugs.
The receiver operates like this and it cannot be modified:
while((n=read(0, buf, 100)>0){
process(buf)
so the sender wants to get processed: "file1" and "file2" for which will have to:
write(pipe[1], "file1\0*95", 100);
write(pipe[1], "file2\0*95", 100);
what I am looking is for a way to do something like that (without being necessary to use the \n as the delimeter):
write(pipe[1], "file1\nfile2", 11); //it would have worked if it was ptmx
(Using read and write)
Flushing in the sense of fflush() is irrelevant to pipes, because they are not represented as C streams. There is therefore no userland buffer to flush. Similarly, fsync() is also irrelevant to pipes, because there is no back-end storage for the data. Data successfully written to a pipe exist in the kernel and only in the kernel until they are successfully read, so there is no work for fsync() to do. Overall, flushing / syncing is applicable only where there is intermediate storage involved, which is not the case with pipes.
With the clarification, your question seems to be about establishing message boundaries for communication via a pipe. You are correct that closing the write end of the pipe will signal a boundary -- not just of one message, but of the whole stream of communication -- but of course that's final. You are also correct that there are no inherent message boundaries. Nevertheless, you seem to be working from at least somewhat of a misconception here:
The easiest way to avoid read blocking on the pipe would be to write
the exact number of bytes that is reading.
[...]
The flush would be convenient when the sender wants to write n
characters but the client reads m characters (where m>n). The client
will block waiting for another m-n characters.
Whether the reader will block is entirely dependent on how the reader is implemented. In particular, the read(2) system call in no way guarantees to transfer the requested number of bytes before returning. It can and will perform a short read under some circumstances. Although the details are unspecified, you can ordinarily rely on a short read when at least one character can be transferred without blocking, but not the whole number requested. Similar applies to write(2). Thus, the easiest way to avoid read() blocking is to ensure that you write at least one byte to the pipe for that read() call to transfer.
In fact, people usually come at this issue from the opposite direction: needing to be certain to receive a specific number of bytes, and therefore having to deal with the potential for short reads as a complication (to be dealt with by performing the read() in a loop). You'll need to consider that, too, but you have the benefit that your client is unlikely to block under the circumstances you describe; it just isn't the problem you think it is.
There is an inherent message-boundary problem in any kind of stream communication, however, and you'll need to deal with it. There are several approaches; among the most commonly used are
Fixed-length messages. The receiver can then read until it successfully transfers the required number of bytes; any blocking involved is appropriate and needful. With this approach, the scenario you postulate simply does not arise, but the writer might need to pad its messages.
Delimited messages. The receiver then reads until it finds that it has received a message delimiter (a newline or a null byte, for example). In this case, the receiver will need to be prepared for the possibility of message boundaries not being aligned with the byte sequences transferred by read() calls. Marking the end of a message by closing the channel can be considered a special case of this alternative.
Embedded message-length metadata. This can take many forms, but one of the simplest is to structure messages as a fixed-length integer message length field, followed by that number of bytes of message data. The reader then knows at every point how many bytes it needs to read, so it will not block needlessly.
These can be used individually or in combination to implement an application-layer protocol for communicating between your processes. Naturally, the parties to the communication must agree on the protocol details for communication to be successful.

Is read() on a nonblocking socket "greedy" on platforms other than Linux (OSX, FreeBSD)?

Consider the following invocation of read() on a nonblocking stream-mode socket (SOCK_STREAM):
ssize_t n = read(socket_fd, buffer, size);
Assume that the remote peer will not close the connection, and will not shut down its writing half of the connection (the reading half, from a local point of view).
On Linux, a short read (n > 0 && n < size) under these circumstances means that the kernel-level read buffer has been exhausted, and an immediate follow-up invocation would normally fail with EAGAIN/EWOULDBLOCK (it would fail unless new data manages to arrive in between the two calls).
In other words, on Linux, an invocation of read() will always consume everything that is immediately available provided that size is large enough.
Likewise for write(), on Linux a short write always means that the kernel-level buffer was filled, and an immediate follow-up invocation is likely to fail with EAGAIN/EWOULDBLOCK.
Question 1: Is this also guaranteed on macOS/OSX?
Question 2: Is this also guaranteed on FreeBSD?
Question 3: Is this required/guaranteed by POSIX?
I know this is true on Linux, because of the following note in the manual page for epoll (section 7):
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
EDIT: As a motivation for the question, consider a case where you want to process input on a number of sockets simultaneously, and for whatever reason, you want to do this by fully exhausting in-kernel buffers for each socket in turn (i.e., "depth first" rather than "breadth first"). This can obviously be done by repeating a read on a ready-ready socket until it fails with EAGAIN/EWOULDBLOCK, but the last invocation would be redundant if the previous read was short, and we knew that a short read was a guarantee of exhaustion.
It is guaranteed by Posix:
data shall be returned to the user as soon as it becomes available.
... and therefore on all the other platforms you mention as well, and also Windows, OS/2, NetWare, ...
Any other implementation would be pointless.

Resources