I'm writing a TCP proxy, using edge-triggered epoll to monitor fd, splice to transmit data. Here is the problem:
How do I know the socket receive buffer is empty?
For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor.
But I found that even splice(sock, 0, pfd[1], 0, 65536, SPLICE_F_NONBLOCK) < 65536 may sometimes lead to starvation.
O_NONBLOCK enabled, n > PIPE_BUF
If the pipe is full, then write(2) fails, with errno set to EAGAIN. Otherwise, from 1 to n bytes may be written (i.e., a "partial write" may occur; the caller should check the return value from write(2) to see how many bytes were actually written), and these bytes may be interleaved with writes by other processes.
So I should repeat calling splice till EAGAIN? But how can I know whether the socket receive buffer is empty or the pipe buffer is full?
Maybe you can use getsockopt syscall with SO_ERROR, and then you will known which socket is really EAGAIN, and then use epoll to watch the read/write event of that socket.
I also have this problem when adding reverse http proxy to my web server, I deem it should work, though I'm not sure if it is the best solution.
Two cases are well-documented in the man pages for non-blocking sockets:
If send() returns the same length as the transfer buffer, the entire transfer finished successfully, and the socket may or may not be in a state of returning EAGAIN/EWOULDBLOCK the next call with >0 bytes to transfer.
If send() returns -1 and errno is EAGAIN/EWOULDBLOCK, none of the transfer finished, and the program needs to wait until the socket is ready for more data (EPOLLOUT in the epoll case).
What's not documented for nonblocking sockets is:
If send() returns a positive value smaller than the buffer size.
Is it safe to assume that the send() would return EAGAIN/EWOULDBLOCK on even one more byte of data? Or should a non-blocking program try to send() one more time to get a conclusive EAGAIN/EWOULDBLOCK? I'm worried about putting an EPOLLOUT watcher on the socket if it's not actually in a "would block" state to respond to it coming out of.
Obviously, the latter strategy (trying again to get something conclusive) has well-defined behavior, but it's more verbose and puts a hit on performance.
A call to send has three possible outcomes:
There is at least one byte available in the send buffer →send succeeds and returns the number of bytes accepted (possibly fewer than you asked for).
The send buffer is completely full at the time you call send.
→if the socket is blocking, send blocks
→if the socket is non-blocking, send fails with EWOULDBLOCK/EAGAIN
An error occurred (e.g. user pulled network cable, connection reset by peer) →send fails with another error
If the number of bytes accepted by send is smaller than the amount you asked for, then this consequently means that the send buffer is now completely full. However, this is purely circumstantial and non-authorative in respect of any future calls to send.
The information returned by send is merely a "snapshot" of the current state at the time you called send. By the time send has returned or by the time you call send again, this information may already be outdated. The network card might put a datagram on the wire while your program is inside send, or a nanosecond later, or at any other time -- there is no way of knowing. You'll know when the next call succeeds (or when it doesn't).
In other words, this does not imply that the next call to send will return EWOULDBLOCK/EAGAIN (or would block if the socket wasn't non-blocking). Trying until what you called "getting a conclusive EWOULDBLOCK" is the correct thing to do.
If send() returns the same length as the transfer buffer, the entire transfer finished successfully, and the socket may or may not be in a blocking state.
No. The socket remains in the mode it was in: in this case, non-blocking mode, assumed below throughout.
If send() returns -1 and errno is EAGAIN/EWOULDBLOCK, none of the transfer finished, and the program needs to wait until the socket is isn't blocking anymore.
Until the send buffer isn't full any more. The socket remains in non-blocking mode.
If send() returns a positive value smaller than the buffer size.
There was only that much room in the socket send buffer.
Is it safe to assume that the send() would block on even one more byte of data?
It isn't 'safe' to 'assume [it] would block' at all. It won't. It's in non-blocking mode. EWOULDBLOCK means it would have blocked in blocking mode.
Or should a non-blocking program try to send() one more time to get a conclusive EAGAIN/EWOULDBLOCK?
That's up to you. The API works whichever you decide.
I'm worried about putting an EPOLLOUT watcher on the socket if it's not actually blocking on that.
It isn't 'blocking on that'. It isn't blocking on anything. It's in non-blocking mode. The send buffer got filled at that instant. It might be completely empty a moment later.
I don't see what you're worried about. If you have pending data and the last write didn't send it all, select for writability, and write when you get it. If such a write sends everything, don't select for writability next time.
Sockets are usually writable, unless their send buffer is full, so don't select for writability all the time, as you just get a spin loop.
I've seen a number of questions regarding send() that discuss the underlying protocol. I'm fully aware that for TCP any message may be broken up into parts as it's sent and there's no guarantee that the receiver will get the message in one atomic operation. In this question I'm talking solely about the behavior of the send() system call as it interacts with the networking layer of the local system.
According to the POSIX standard, and the send() documentation I've read, the length of the message to be sent is specified by the length argument. Note that: send() sends one message, of length length. Further:
If space is not available at the sending socket to hold the message to
be transmitted, and the socket file descriptor does not have
O_NONBLOCK set, send() shall block until space is available. If space
is not available at the sending socket to hold the message to be
transmitted, and the socket file descriptor does have O_NONBLOCK set,
send() shall fail.
I don't see any possibility in this definition for send() to ever return any value other than -1 (which means no data is queued in the kernel to be transmitted) or length, which means the entire message is queued in the kernel to be transmitted. I.e., it seems to me that send() must be atomic with respect to locally queuing the message for delivery in the kernel.
If there is enough room in the socket queue in the kernel for the entire message and no signal occurs (normal case), it's copied and returns length.
If a signal occurs during send(), then it must return -1. Obviously we cannot have queued part of the message in this case, since we don't know how much was sent. So nothing can be sent in this situation.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is blocking, then according to the above statement send() must block until space becomes available. Then the message will be queued and send() returns length.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is non-blocking, then send() must fail (return -1) and errno will be set to EAGAIN or EWOULDBLOCK. Again, since we return -1 it's clear that in this situation no part of the message can be queued.
Am I missing something? Is it possible for send() to return a value which is >=0 && <length? In what situation? What about non-POSIX/UNIX systems? Is the Windows send() implementation conforming with this?
Your point 2 is over-simplified. The normal condition under which send returns a value greater than zero but less than length (note that, as others have said, it can never return zero except possibly when the length argument is zero) is when the message is sufficiently long to cause blocking, and an interrupting signal arrives after some content has already been sent. In this case, send cannot fail with EINTR (because this would prevent the application from knowing it had already successfully sent some data) and it cannot re-block (since the signal is interrupting, and the whole point of that is to get out of blocking), so it has to return the number of bytes already sent, which is less than the total length requested.
According to the Posix specification and all the man 2 send pages I have ever seen in 30 years, yes, send() can return any value > 0 and <= length. Note that it cannot return zero.
According to a discussion a few years ago on news:comp.protocols.tcp-ip where all the TCP implementors are, a blocking send() won't actually return until it has transferred all the data to the socket send buffer: in other words, the return value is either -1 or length. It was agreed that this was true of all known implementations, and also true of write(), writev(), sendmsg(), writev(),
I know how the thing works on Linux, with the GNU C Library. Point 4 of your question reads differently in this case. If you set the flag O_NONBLOCK for the file descriptor, and if it is not possible to queue the entire message in the kernel atomically, send() returns the number of bytes actually sent (it can be between 1 and length), and errno is set to EWOULDBLOCK.
(With a file descriptor working in the blocking mode, send() would block.)
It is possible for send() to return a value >= 0 && < length. This could happen if the send buffer has less room than the length of the message upon a call to send(). Similarly, if the current receiver window size known to the sender is smaller than the length of the message, only part of the message may be sent. Anecdotally, I've seen this happen on Linux through the a localhost connection when the receiving process was slow to unload the data it was receiving from its receive buffer.
My sense is that one's actual experience will vary a good bit by implementation. From this Microsoft link, it's clear that a non-error return value less than the length can occur.
It is also possible to get a return value of zero (again, at least with some implementations) if a zero-length message is sent.
This answer is based on my experience, as well as drawing upon this SO answer particularly.
Edit: From this answer and its comments, evidently an EINTR failure may only result if the interruption comes before any data is sent, which would be another possible way to get such a return value.
On a 64-bit Linux system:
sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4294967296, 0, NULL, 0) = 2147479552
So, even trying to send lowy 4GB, Linux chickens out and sends less than 2GB. So, if you think that you'll ask it to send 1TB and it patiently will sit there, keep wishing.
Similarly, on an embedded system with just a few KBs free, don't think that it'll fail or will wait for something - it'll send as much as it can, and tell you how much that was, letting you to retry with the rest (or do something else in the meantime).
Everyone agrees that in case of EINTR, there can be a short send. But EINTR can happen at any time, so there can always be a short send.
And finally, POSIX says that the number of bytes sent is returned, period. And whole Unix and POSIX which formalizes it is built on the concept of short read/writes, which allows implementations of POSIX systems to scale from the tiniest embedded to supercomputers with proverbial "bigdata". So, no need to try to read between the lines and find indulgences to a particular adhoc implementation you have on your hands. There're many more implementations out there, and as long as you follow the word of the standard, your app will be portable among them.
To clarify a little, where it says:
shall block until space is available.
there are several ways to wake up from that block/sleep:
Enough space becomes available.
A signal interrupts the current blocking operation.
SO_SNDTIMEO is set for the socket and the timeout expires.
Other, e.g. the socket is closed in another thread.
So things end up thus:
If there is enough room in the socket queue in the kernel for the entire message and no signal occurs (normal case), it's copied and returns length.
If a signal occurs during send(), then it must return -1. Obviously we cannot have queued part of the message in this case, since we don't know how much was sent. So nothing can be sent in this situation.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is blocking, then according to the above statement send() must block until space becomes available. Then the message will be queued and send() returns length. Then send() can be interrupted by a signal, the send timeout can elapse,... causing a short send/partial write. Reasonable implementations will return -1 and set errno to an adequate value if nothing was copied to the send buffer.
If there is not enough room in the socket queue in the kernel for the entire message and the socket is non-blocking, then send() must fail (return -1) and errno will be set to EAGAIN or EWOULDBLOCK. Again, since we return -1 it's clear that in this situation no part of the message can be queued.
The goal is to read data from a socket without blocking. The Linux manual page says:
The receive calls normally return any data available, up to the
requested amount, rather than waiting for receipt of the full amount
requested.
Does it mean that I don't have to pass MSG_DONTWAIT flag to recv() after polling the socket descriptor with select()/poll()/epoll()?
The behaviour of recv/read depends on the characteristics of the socket itself. If the socket is marked as non-blocking, these calls should immediately return EAGAIN/EWOULDBLOCK rather than blocking the process.
The socket can be marked as non-blocking prior to reading from it, usually via fcntl or ioctl.
What this excerpt from the manual says is that, basically, reads on both blocking and non-blocking sockets are not required to fill the whole buffer that is supplied. That is why it is important to check the result of the recv/read calls in order to know how much of the buffer contains the actual data and how much is garbage.
It is not a good idea at all to use blocking sockets in conjunction with the IO polling calls such as select/poll/epoll. Even if the polling call indicates that a particular socket is ready for reading, a blocking socket would sometimes still block.
When a socket is signalled as being OK to write by a call to select(), how can I know how much data I can send without blocking? (In the case of full send buffers etc.)
Does inclusion in the set returned by select() signify that the socket is ready for at least one byte of data, and will send() then return a short count of written bytes?
Or will it block when I call send() with a len parameter that is bigger than the available buffer space? If so, how do I know the maximum amount?
I'm using regular C sockets on Linux.
The send call should not block on the first call, and should send at least one byte on the first call -- assuming you are using a stream protocol and assuming it's not interrupted by a signal, etc. However, there are really only two ways to figure out how much data you can send:
Call select after every call to send to see if more data can be sent.
Put the socket in non-blocking mode, and call send until it gives an EAGAIN or EWOULDBLOCK error.
The second option is preferred. (The third option is to do it in a different thread and simply let the thread block, which is also a good solution. In the past, threading implementations weren't as mature so non-blocking mode was seen as necessary for high-performance servers.)
You cannot know. You have to sent the socket to be non-blocking, and then pay attention to the return value that tells you how much it has written.