using splice with socket may cause starvation - c

I'm writing a TCP proxy, using edge-triggered epoll to monitor fd, splice to transmit data. Here is the problem:
How do I know the socket receive buffer is empty?
For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor.
But I found that even splice(sock, 0, pfd[1], 0, 65536, SPLICE_F_NONBLOCK) < 65536 may sometimes lead to starvation.
O_NONBLOCK enabled, n > PIPE_BUF
If the pipe is full, then write(2) fails, with errno set to EAGAIN. Otherwise, from 1 to n bytes may be written (i.e., a "partial write" may occur; the caller should check the return value from write(2) to see how many bytes were actually written), and these bytes may be interleaved with writes by other processes.
So I should repeat calling splice till EAGAIN? But how can I know whether the socket receive buffer is empty or the pipe buffer is full?

Maybe you can use getsockopt syscall with SO_ERROR, and then you will known which socket is really EAGAIN, and then use epoll to watch the read/write event of that socket.
I also have this problem when adding reverse http proxy to my web server, I deem it should work, though I'm not sure if it is the best solution.

Related

c sockets sendmsg MSG_DONTWAIT - buffer reuse

I'm using C Sockets to send ICMP packets with the MSG_DONTWAIT flag set.
My program is single threaded but it expected to send messages at high frequency so I'm setting the message send as non blocking.
Is it safe to share/modify/reuse the message buffer after each call ? (Unless EAGAIN or EWOULDBLOCK is returned).
msg_control (the ancillary data) is reused and msg_control->struct in_pktinfo->ipi_ifindex (outbound interface ifindex) is modified between calls.
The iov.iov_base buffer content (not pointer!) and iov.iov_len can also change between calls.
(Less likely but still possible).
Is it OK to change ifinex and iov_base content between calsl at high frequency in non blocking mode ? (unless I get back EAGAIN or EWOULDBLOCK)
Thanks !
Yes, it's safe. On Linux, all the data you specify gets immediately copied into a buffer in the kernel, before send returns. If the kernel's buffer is full, it returns EAGAIN or EWOULDBLOCK (which are the same thing in Linux, apparently) and nothing happens. You don't have to worry that the kernel will go and send the packet later after you've changed the data in the buffer.
On Windows, non-blocking "overlapped" operations do remember your buffer and use it later - so watch out for that if you ever do non-blocking I/O on Windows. (You'll know if you do, because it's totally different from blocking I/O)

Is it necessary to use non-blocking file descriptors with IO multiplexing?

Posix supports blocking and non-blcoking file descriptors. Second ones may be opened with O_NONBLOCK flag. I have a main loop in my app, which polls some set (poll sys call) of file descriptors for POLLIN and POLLOUT events. May I still use blocking file descriptors, cause I write only when POLLOUT is set and read only when POLLIN is set?
Accroding to poll(2) man page:
POLLOUT Writing is now possible, though a write larger that the available space in a socket or pipe will still block (unless O_NONBLOCK is set).
In other words: if there is not enough space in kernel buffer associated with this fd, writing a chunk of data, larger than space available in buffer would block. If there is space available they behave identically.
So you must set all your file descriptors to be non-blocking, especially TCP sockets, cause if the process on the other side has slow connection you may face blocking write call, until client won't send you back all ACKs for every IP package.

When a non-blocking send() only transfers partial data, can we assume it would return EWOULDBLOCK the next call?

Two cases are well-documented in the man pages for non-blocking sockets:
If send() returns the same length as the transfer buffer, the entire transfer finished successfully, and the socket may or may not be in a state of returning EAGAIN/EWOULDBLOCK the next call with >0 bytes to transfer.
If send() returns -1 and errno is EAGAIN/EWOULDBLOCK, none of the transfer finished, and the program needs to wait until the socket is ready for more data (EPOLLOUT in the epoll case).
What's not documented for nonblocking sockets is:
If send() returns a positive value smaller than the buffer size.
Is it safe to assume that the send() would return EAGAIN/EWOULDBLOCK on even one more byte of data? Or should a non-blocking program try to send() one more time to get a conclusive EAGAIN/EWOULDBLOCK? I'm worried about putting an EPOLLOUT watcher on the socket if it's not actually in a "would block" state to respond to it coming out of.
Obviously, the latter strategy (trying again to get something conclusive) has well-defined behavior, but it's more verbose and puts a hit on performance.
A call to send has three possible outcomes:
There is at least one byte available in the send buffer →send succeeds and returns the number of bytes accepted (possibly fewer than you asked for).
The send buffer is completely full at the time you call send.
→if the socket is blocking, send blocks
→if the socket is non-blocking, send fails with EWOULDBLOCK/EAGAIN
An error occurred (e.g. user pulled network cable, connection reset by peer) →send fails with another error
If the number of bytes accepted by send is smaller than the amount you asked for, then this consequently means that the send buffer is now completely full. However, this is purely circumstantial and non-authorative in respect of any future calls to send.
The information returned by send is merely a "snapshot" of the current state at the time you called send. By the time send has returned or by the time you call send again, this information may already be outdated. The network card might put a datagram on the wire while your program is inside send, or a nanosecond later, or at any other time -- there is no way of knowing. You'll know when the next call succeeds (or when it doesn't).
In other words, this does not imply that the next call to send will return EWOULDBLOCK/EAGAIN (or would block if the socket wasn't non-blocking). Trying until what you called "getting a conclusive EWOULDBLOCK" is the correct thing to do.
If send() returns the same length as the transfer buffer, the entire transfer finished successfully, and the socket may or may not be in a blocking state.
No. The socket remains in the mode it was in: in this case, non-blocking mode, assumed below throughout.
If send() returns -1 and errno is EAGAIN/EWOULDBLOCK, none of the transfer finished, and the program needs to wait until the socket is isn't blocking anymore.
Until the send buffer isn't full any more. The socket remains in non-blocking mode.
If send() returns a positive value smaller than the buffer size.
There was only that much room in the socket send buffer.
Is it safe to assume that the send() would block on even one more byte of data?
It isn't 'safe' to 'assume [it] would block' at all. It won't. It's in non-blocking mode. EWOULDBLOCK means it would have blocked in blocking mode.
Or should a non-blocking program try to send() one more time to get a conclusive EAGAIN/EWOULDBLOCK?
That's up to you. The API works whichever you decide.
I'm worried about putting an EPOLLOUT watcher on the socket if it's not actually blocking on that.
It isn't 'blocking on that'. It isn't blocking on anything. It's in non-blocking mode. The send buffer got filled at that instant. It might be completely empty a moment later.
I don't see what you're worried about. If you have pending data and the last write didn't send it all, select for writability, and write when you get it. If such a write sends everything, don't select for writability next time.
Sockets are usually writable, unless their send buffer is full, so don't select for writability all the time, as you just get a spin loop.

Can send() on a TCP socket return >=0 and <length?

I've seen a number of questions regarding send() that discuss the underlying protocol. I'm fully aware that for TCP any message may be broken up into parts as it's sent and there's no guarantee that the receiver will get the message in one atomic operation. In this question I'm talking solely about the behavior of the send() system call as it interacts with the networking layer of the local system.
According to the POSIX standard, and the send() documentation I've read, the length of the message to be sent is specified by the length argument. Note that: send() sends one message, of length length. Further:
If space is not available at the sending socket to hold the message to
be transmitted, and the socket file descriptor does not have
O_NONBLOCK set, send() shall block until space is available. If space
is not available at the sending socket to hold the message to be
transmitted, and the socket file descriptor does have O_NONBLOCK set,
send() shall fail.
I don't see any possibility in this definition for send() to ever return any value other than -1 (which means no data is queued in the kernel to be transmitted) or length, which means the entire message is queued in the kernel to be transmitted. I.e., it seems to me that send() must be atomic with respect to locally queuing the message for delivery in the kernel.
If there is enough room in the socket queue in the kernel for the entire message and no signal occurs (normal case), it's copied and returns length.
If a signal occurs during send(), then it must return -1. Obviously we cannot have queued part of the message in this case, since we don't know how much was sent. So nothing can be sent in this situation.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is blocking, then according to the above statement send() must block until space becomes available. Then the message will be queued and send() returns length.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is non-blocking, then send() must fail (return -1) and errno will be set to EAGAIN or EWOULDBLOCK. Again, since we return -1 it's clear that in this situation no part of the message can be queued.
Am I missing something? Is it possible for send() to return a value which is >=0 && <length? In what situation? What about non-POSIX/UNIX systems? Is the Windows send() implementation conforming with this?
Your point 2 is over-simplified. The normal condition under which send returns a value greater than zero but less than length (note that, as others have said, it can never return zero except possibly when the length argument is zero) is when the message is sufficiently long to cause blocking, and an interrupting signal arrives after some content has already been sent. In this case, send cannot fail with EINTR (because this would prevent the application from knowing it had already successfully sent some data) and it cannot re-block (since the signal is interrupting, and the whole point of that is to get out of blocking), so it has to return the number of bytes already sent, which is less than the total length requested.
According to the Posix specification and all the man 2 send pages I have ever seen in 30 years, yes, send() can return any value > 0 and <= length. Note that it cannot return zero.
According to a discussion a few years ago on news:comp.protocols.tcp-ip where all the TCP implementors are, a blocking send() won't actually return until it has transferred all the data to the socket send buffer: in other words, the return value is either -1 or length. It was agreed that this was true of all known implementations, and also true of write(), writev(), sendmsg(), writev(),
I know how the thing works on Linux, with the GNU C Library. Point 4 of your question reads differently in this case. If you set the flag O_NONBLOCK for the file descriptor, and if it is not possible to queue the entire message in the kernel atomically, send() returns the number of bytes actually sent (it can be between 1 and length), and errno is set to EWOULDBLOCK.
(With a file descriptor working in the blocking mode, send() would block.)
It is possible for send() to return a value >= 0 && < length. This could happen if the send buffer has less room than the length of the message upon a call to send(). Similarly, if the current receiver window size known to the sender is smaller than the length of the message, only part of the message may be sent. Anecdotally, I've seen this happen on Linux through the a localhost connection when the receiving process was slow to unload the data it was receiving from its receive buffer.
My sense is that one's actual experience will vary a good bit by implementation. From this Microsoft link, it's clear that a non-error return value less than the length can occur.
It is also possible to get a return value of zero (again, at least with some implementations) if a zero-length message is sent.
This answer is based on my experience, as well as drawing upon this SO answer particularly.
Edit: From this answer and its comments, evidently an EINTR failure may only result if the interruption comes before any data is sent, which would be another possible way to get such a return value.
On a 64-bit Linux system:
sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4294967296, 0, NULL, 0) = 2147479552
So, even trying to send lowy 4GB, Linux chickens out and sends less than 2GB. So, if you think that you'll ask it to send 1TB and it patiently will sit there, keep wishing.
Similarly, on an embedded system with just a few KBs free, don't think that it'll fail or will wait for something - it'll send as much as it can, and tell you how much that was, letting you to retry with the rest (or do something else in the meantime).
Everyone agrees that in case of EINTR, there can be a short send. But EINTR can happen at any time, so there can always be a short send.
And finally, POSIX says that the number of bytes sent is returned, period. And whole Unix and POSIX which formalizes it is built on the concept of short read/writes, which allows implementations of POSIX systems to scale from the tiniest embedded to supercomputers with proverbial "bigdata". So, no need to try to read between the lines and find indulgences to a particular adhoc implementation you have on your hands. There're many more implementations out there, and as long as you follow the word of the standard, your app will be portable among them.
To clarify a little, where it says:
shall block until space is available.
there are several ways to wake up from that block/sleep:
Enough space becomes available.
A signal interrupts the current blocking operation.
SO_SNDTIMEO is set for the socket and the timeout expires.
Other, e.g. the socket is closed in another thread.
So things end up thus:
If there is enough room in the socket queue in the kernel for the entire message and no signal occurs (normal case), it's copied and returns length.
If a signal occurs during send(), then it must return -1. Obviously we cannot have queued part of the message in this case, since we don't know how much was sent. So nothing can be sent in this situation.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is blocking, then according to the above statement send() must block until space becomes available. Then the message will be queued and send() returns length. Then send() can be interrupted by a signal, the send timeout can elapse,... causing a short send/partial write. Reasonable implementations will return -1 and set errno to an adequate value if nothing was copied to the send buffer.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is non-blocking, then send() must fail (return -1) and errno will be set to EAGAIN or EWOULDBLOCK. Again, since we return -1 it's clear that in this situation no part of the message can be queued.

C select() writefds

I am having trouble understanding what it means to add a descriptor to writefds set for select() in linux. I wrote some simple code that adds the descriptor for stdout to the writefds set and uses a timeout of NULL. Now, my code just infinite loops checking if this descriptor is set, and if it does, it prints "WRITING". When I run my code it just keeps printing "WRITING" to infinity. The same thing happens when I do this for stdin. Again, there is no other code in the loop. Are stdin/stdout always just ready for writing?
It means you can call write on that fd and the kernel promises to not-block and consume at least 1 byte.
More details. If your socket is not in non-blocking mode and the kernel buffers associated with the socket are full, the kernel will put your thread to sleep until it can empty some of the buffer and be able to consume part of your write.
If your socket is in non-blocking mode and the kernel buffers are full, the write will return immediately without consuming any bytes.
The answer to the question "Is stdout always ready for writing" is "It depends."
stdout can be connected to anything that can be opened as a file descriptor - like a disk file, a network socket, or a pipe. The usual case is that it's connected to a terminal device.
Most of these types of file descriptors can block on writing (which means they might not be marked writeable after select() returns), but usually only if you're just written a very large amount of data to them (and so filled some kind of buffer). "Large amount" varies between the device types - if your stdout terminal is a 9600 baud serial device, then you could fill the write buffer pretty easily; an xterm, not so much.
Some device will never block - like disk files, or /dev/null, for example. (write() to a disk file might not complete immediately, but this isn't considered "blocking" - it's a "disk wait").
Yes, a truthy return from FD_ISSET(fd, &writefds) means fd is writeable. If you call select() with that FD set in the writefds after you get EWOULDBLOCK or EAGAIN (equivalent on Linux, at least) it blocks until the FD is again writeable.
There's more to it than that. For instance, an FD is also considered writeable if you've done a non-blocking connect() on it, you got EAGAIN, and call select() to wait for the connection to be established. That establishment is signalled in the writefds.

Resources