How does list I/O writev internally work? - c

The writev function takes an array of struct iovec as input argument
writev(int fd, const struct iovec *iov, int iovcnt);
The input is a list of memory buffers that need to be written to a file (say). What I want to know is:
Does writev internally do this:
for (each element in iov)
write(element)
such that every element of iov is written to file in a separate I/O call? Or does writev write everything to file in a single I/O call?

Per the standards, the for loop you mentioned is not a valid implementation of writev, for several reasons:
The loop could fail to finish writing one iov before proceeding to the next, in the event of a short write - but this could be worked around by making the loop more elaborate.
The loop could have incorrect behavior with respect to atomicity for pipes: if the total write length is smaller than PIPE_BUF, the pipe write is required to be atomic, but the loop would break the atomicity requirement. This issue cannot be worked around except by moving all the iov entries into a single buffer before writing when the total length is at most PIPE_BUF.
The loop might have cases where it could result in blocking, where the single writev call would be required to perform a partial write without blocking. As far as I know, this issue would be impossible to work around in the general case.
Possibly other reasons I haven't thought of.
I'm not sure about point #3, but it definitely exists in the opposite direction, when reading. Calling read in a loop could block if a terminal has some data (shorter than the total iov length) available followed by an EOF indicator; calling readv should return immediately with a partial read in this case. However, due to a bug in Linux, readv on terminals is actually implemented as a read loop in kernelspace, and it does exhibit this blocking bug. I had to work around this bug in implementing musl's stdio:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=commit;h=2cff36a84f268c09f4c9dc5a1340652c8e298dc0
To answer the last part of your question:
Or does writev write everything to file in a single I/O call?
In all cases, a conformant writev implementation will be a single syscall. Getting down to how it's implemented on Linux: for ordinary files and for most devices, the underlying file driver has methods that implement iov-style io directly, without any sort of internal loop. But the terminal driver on Linux is highly outdated and lacks the modern io methods, causing the kernel to fallback to a write/read loop for writev/readv when operating on a terminal.

The direct way to know how code works is read the source code.
see http://www.oschina.net/code/explore/glibc-2.9/sysdeps/posix/writev.c
It simplely alloca() or malloc() a buffer, copy all vectors into it, and call write() once.
That how it works. Nothing mysterious.

Or does writev write everything to file in a single I/O call?
I'm afarid not everything, though sys_writev try its best to write everything in a single call. it's depends on vfs's implement, if the vfs doesn't give an implement of writev, then kenerl will call vfs' write() in a loop. it's better to check the return value of writev/readv to see how many bytes wrotten as you do in write().
you can find the code of writev in kernel, fs/read_write.c:do_readv_writev.

Related

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:
A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.
On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:
Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?
Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)
The short answer is No, there is no portable way to do that.
The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)
Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.
However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.
You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)
In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)
In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.
When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.
If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)
I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.
There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.
For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).
However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!
If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

Is write() safe to be called from multiple threads simultaneously?

Assuming I have opened dev/poll as mDevPoll, is it safe for me to call code like this
struct pollfd tmp_pfd;
tmp_pfd.fd = fd;
tmp_pfd.events = POLLIN;
// Write pollfd to /dev/poll
write(mDevPoll, &tmp_pfd, sizeof(struct pollfd));
...simultaneously from multiple threads, or do I need to add my own synchronisation primitive around mDevPoll?
Solaris 10 claims to be POSIX compliant. The write() function is not among the handful of system interfaces that POSIX permits to be non-thread-safe, so we can conclude that that on Solaris 10, it is safe in a general sense to call write() simultaneously from two or more threads.
POSIX also designates write() among those functions whose effects are atomic relative to each other when they operate on regular files or symbolic links. Specifically, it says that
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
If your writes were directed to a regular file then that would be sufficient to conclude that your proposed multi-thread actions are safe, in the sense that they would not interfere with one another, and the data written in one call would not be commingled with that written by a different call in any thread. Unfortunately, /dev/poll is not a regular file, so that does not apply directly to you.
You should also be aware that write() is not in general required to transfer the full number of bytes specified in a single call. For general purposes, one must therefore be prepared to transfer the desired bytes over multiple calls, by using a loop. Solaris may provide applicable guarantees beyond those expressed by POSIX, perhaps specific to the destination device, but absent such guarantees it is conceivable that one of your threads performs a partial write, and the next write is performed by a different thread. That very likely would not produce the results you want or expect.
It's not safe in theory, even though write() is completely thread-safe (barring implementation bugs...). Per the POSIX write() standard (emphasis mine):
.
The write() function shall attempt to write nbyte bytes from the
buffer pointed to by buf to the file associated with the open file
descriptor, fildes.
...
RETURN VALUE
Upon successful completion, these functions shall return the number of bytes actually written ...
There is no guarantee that you won't get a partial write(), so even if each individual write() call is atomic, it's not necessarily complete, so you could still get interleaved data because it may take more than one call to write() to completely write all data.
In practice, if you're only doing relatively small write() calls, you will likely never see a partial write(), with "small" and "likely" being indeterminate values dependent on your implementation.
I've routinely delivered code that uses unlocked single write() calls on regular files opened with O_APPEND in order to improve the performance of logging - build a log entry then write() the entire entry with one call. I've never seen a partial or interleaved write() result over almost a couple of decades of doing that on Linux and Solaris systems, even when many processes write to the same log file. But then again, it's a text log file and if a partial or interleaved write() does happen there would be no real damage done or even data lost.
In this case, though, you're "writing" a handful of bytes to a kernel structure. You can dig through the Solaris /dev/poll kernel driver source code at Illumos.org and see how likely a partial write() is. I'd suspect it's practically impossible - because I just went back and looked at the multiplatform poll class that I wrote for my company's software library a decade ago. On Solaris it uses /dev/poll and unlocked write() calls from multiple threads. And it's been working fine for a decade...
Solaris /dev/pool Device Driver Source Code Analysis
The (Open)Solaris source code can be found here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/devpoll.c#628
The dpwrite() function is the code in the /dev/poll driver that actually performs the "write" operation. I use quotes because it's not really a write operation at all - data isn't transferred as much as the data in the kernel that represents the set of file descriptors being polled is updated.
Data is copied from user space into kernel space - to a memory buffer obtained with kmem_alloc(). I don't see any possible way that can be a partial copy. Either the allocation succeeds or it doesn't. The code can get interrupted before doing anything, as it wait for exclusive write() access to the kernel structures.
After that, the last return call is at the end - and if there's no error, the entire call is marked successful, or the entire call fails on any error:
995 if (error == 0) {
996 /*
997 * The state of uio_resid is updated only after the pollcache
998 * is successfully modified.
999 */
1000 uioskip(uiop, copysize);
1001 }
1002 return (error);
1003}
If you dig through Solaris kernel code, you'll see that uio_resid is what ends up being the value returned by write() after a successful call.
So the call certainly appears to be all-or-nothing. While there appear to be ways for the code to return an error on a file descriptor after successfully processing an earlier descriptor when multiple descriptors are passed in, the code doesn't appear to return any partial success indications.
If you're only processing one file descriptor at a time, I'd say the /dev/poll write() operation is completely thread-safe, and it's almost certainly thread-safe for "writing" updates to multiple file descriptors as there's no apparent way for the driver to return a partial write() result.

Atomicity of UNIX read()/write() when sending data to device

When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.

under what conditions are pipe reads atomic?

man pipe -s7 documents writing to the pipe very well.
The part important to me being that the write will only ever be partially completed if O_NONBLOCK is set, and write length is greater than PIPE_BUF.
However, nothing is said about the read end.
I am sending structures representing events through my pipe in blocking mode at the write end.
At the read end, i am processing those events (and other things) in an update loop in non-blocking mode.
Since my struct is smaller than PIPE_BUF, will read ALWAYS read a whole number of structs? Or do i need to handle the possibility of only part my struct being read ?
Common sense tells me that read behavior will mirror the documented write behavior, but i would be happier if this was specified.
I'm working on Linux ( kernel 3.8, x86_64 ). But it is important that my code is portable across different UNIX flavors, and CPU architectures.
Thanks.
Chris.
The comments are right: read is not atomic. The whole point of atomicity of write is to allow multiple writers without corruption from interleaving data. Multiple readers are much less useful, but even if they were useful, supporting atomic reads would require maintaining packet boundaries in pipes, which do not exist.
Reads from a pipe are not atomic.
From the standard page for read()
The standard developers considered adding atomicity requirements to a pipe or FIFO, but recognized that due to the nature of pipes and FIFOs there could be no guarantee of atomicity of reads of {PIPE_BUF} or any other size that would be an aid to applications portability.

What amount of data does select (2) guarantee to be able to be written to a file without blocking

select (2) (amongst other things) tells me whether I can write to a fd of a file without blocking. However, does it guarentee me that I can write a full 4096 bytes without blocking?
Note I am interested in normal files on disk. Not sockets or the like.
In other words: does select signal when we can just write one single byte to a file fd without blocking, or does it signal when we can write n (4096, ... ?) bytes to a file fd without blocking.
Whenever select() indicates that your file is ready, you can try writing N bytes, for any N>0. write() will return the number of bytes actually written. If it equals N, you can write again. If it's less than N, then the next write will block.
Note Normal files on disk don't block. Sockets, pipes and terminals do.
You tagged this "Linux", so what does the kernel source code tell you? It should be pretty easy to read the syscall implementation to find when select decides to treat a file descriptor as ready for writing.
If you're worried about blocking, though, you're doing it wrong. If you don't want to block, use O_NONBLOCK or equivalents. Even if select did guarantee a certain number of bytes could be written without blocking, that would only be true at the time select returns; it might not necessarily be true by the time you actually perform the write.
Note I am interested in normal files on disk. Not sockets or the like.
select does not "work" with normal files, only sockets/pipes/ttys and possibly others, but not regular files. For regular files select will always signal the file descriptor as readable/writable - thus it is a rather useless exercise to use select with files.
note that that applies to other io multiplexing facilities as well, such as poll/epoll. AIO will do asynchonous io to regular files, but operating system support might vary, and it is a rather complex api to use
As to how much data you can write, there is no promise. 4096 is no magical number that select assumes you can write without blocking, when applied to filedescriptors where using select does make sense (sockets/pipes/etc.) . Because you can't know how much data you can write without blocking, you should always set the file descriptor to non-blocking, record how much was actually written as indicated by the return value of write/send and start writing from that point the next time select indicates you can write data again.
select() only promises that the applicable call can be made without blocking, it does not guarantee an I/O amount (4096) in your case. Since select() can be used with different types of descriptors (file, sockets, serial connections, etc.) you may notice that for disk operations the observed behavior is that a full buffer can always be written, but again this is specific to the particular underlying operation and not a promise of select().

Resources