How can I buffer non-blocking IO?

How can I buffer non-blocking IO? - c

When I need buffered IO on blocking file descriptor I use stdio. But if I turn file descriptor into non-blocking mode according to manual stdio buffering is unusable. After some research I see that BIO can be usable for buffering non-blocking IO.
But may be there are other alternatives?
I need this to avoid using threads in a multi-connection environment.

I think what you are talking about is the Reactor Pattern. This is a pretty standard way of processing lots of network connections without threads, and is very common in multiplayer game server engines. Another implementation (in python) is twisted matrix.
The basic algorith is:
have a buffer for each socket
check which sockets are ready to read (select(), poll(), or just iterate)
for each socket:
call recv() and accumulate the contents into the socket's buffer until recv returns 0 or an error with EWOULDBLOCK
call application level data handler for the socket with the contents of the buffer
clear the socket's buffer

I see the question has been edited now, and is at least more understandable than before.
Anyway, isn't this a contradiction?
You make I/O non-blocking because you want to be able to read small amounts quickly, typically sacrificing throughput for latency.
You make it buffered because you don't care that much about latency, but want to make efficient use of the I/O subsystem by trading latency for throughput.
Doing them both at the same time seems like a contradiction, and is hard to imagine.
What are the semantics you're after? If you do this:
int fd;
char buf[1024];
ssize_t got;
fd = setup_non_blocking_io(...);
got = read(fd, buf, sizeof buf);
What behavior do you expect if there is 3 bytes available? Blocking/buffered I/O might block until able to read more satisfy your request, non-blocking I/O would return the 3 available bytes immediately.
Of course, if you have some protocol on top, that defines some kind of message structure so that you can know that "this I/O is incomplete, I can't parse it until I have more data", you can buffer it yourself at that level, and not pass data on upwards until a full message has been received.

Depending on the protocol, it is certainly possible that you will need to buffer your reads for a non-blocking network node (client or server).
Typically, these buffers provide multiple indexes (offsets) that both record the position of the last byte processed and last byte read (which is either the same or greater than the processed offset). And they also (should) provide richer semantics of compacting the buffer, transparent buffer size management, etc.
In Java (at least) the non-blocking network io (NIO) packages also provide a set of data structures (ByteBuffer, etc.) that are geared towards providing a general data structure.
There either exists such data structures for C, or you must roll your own. Once you have it, then simply read as much data as available and let the buffer manage issues such as overflow (e.g. reading bytes across message frame boundaries) and use the marker offset to mark off the bytes that you have processed.
As Android pointed out, you will (very likely) need to create matched buffers for each open connection.

You could create a struct with buffers for each open file descriptor, then accumulate these buffers until recv() returns 0 or you have data enough to process in your buffer.
If I understand your question correctly, you can't buffer because with non-blocking you're writing to the same buffer with multiple connections (if global) or just writing small pieces of data (if local).
In any case, your program has to be able to identify where the data is coming (possibly by file descriptor) from and buffer it accordingly.
Threading is also an option, it's not as scary as many make it sound out to be.

Ryan Dahl's evcom library which does exactly what you wanted.
I use it in my job and it works great. Be aware, though, that it doesn't (yet, but coming soon) have async DNS resolving. Ryan suggests udns by Michael Tokarev for that. I'm trying to adopt udns instead of blocking getaddrinfo() now.

Related

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?

PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.

A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

Can I get socket buffer remainder size?

I have a real-time system, so I using the non-blocking socket to send my data
But there is happened that the socket buffer is full,
so the send function's return value less than my data length.
If I save the return length and re-send, there is not different with blocking socket?
So can I get the socket buffer's remainder size? I can check it first,
if it is enough then I call send, skip send else.
Thank you, all.

Well there is a difference between blocking and non-blocking - if you experience a short write you don't block. That's the whole point of non-blocking. It give you an opportunity to do something more pressing while waiting for some buffer space to free up.
Your concern seems to be the repeated attempts to write a full message, that is, a form of polling. But a check of the bytes free in the buffer is the same thing, you are just substituting the call to the availability with the call to write. You really don't gain anything efficiency wise.
The commonplace solution to this is to use something like select or poll that monitors the socket descriptor for the ability to write (and least some) bytes. This allows you stop polling and foist some of the work off on the kernel to monitor the space availability for you.
That said, if you really want to check to see how much space is available there are usually work arounds that tend to somewhat platform specific, mostly ioctl calls with various platform specific parameters like FIONWRITE, SIOCOUTQ, etc. You would need to investigate exactly what your platform provides. But, again, it is better to consider if this is really something you need in the first place.

If the asynchronous send fails with EWOULDBLOCK/EAGAIN, no data is sent. You could then try to send something else, or wait until the buffer is free again.
Also see https://stackoverflow.com/questions/19391208/when-a-non-blocking-send-only-transfers-partial-data-can-we-assume-it-would-r - a related issue is discussed there.

Send data to multiple sockets using pipes, tee() and splice()

I'm duplicating a "master" pipe with tee() to write to multiple sockets using splice(). Naturally these pipes will get emptied at different rates depending on how much I can splice() to the destination sockets. So when I next go to add data to the "master" pipe and then tee() it again, I may have a situation where I can write 64KB to the pipe but only tee 4KB to one of the "slave" pipes. I'm guessing then that if I splice() all of the "master" pipe to the socket, I will never be able to tee() the remaining 60KB to that slave pipe. Is that true? I guess I can keep track of a tee_offset (starting at 0) which I set to the start of the "unteed" data and then don't splice() past it. So in this case I would set tee_offset to 4096 and not splice more than that until I'm able to tee it to all to the other pipes. Am I on the right track here? Any tips/warnings for me?

If I understand correctly, you've got some realtime source of data that you want to multiplex to multiple sockets. You've got a single "source" pipe hooked up to whatever's producing your data, and you've got a "destination" pipe for each socket over which you wish to send the data. What you're doing is using tee() to copy data from the source pipe to each of the destination pipes and splice() to copy it from the destination pipes to the sockets themselves.
The fundamental issue you're going to hit here is if one of the sockets simply can't keep up - if you're producing data faster than you can send it, then you're going to have a problem. This isn't related to your use of pipes, it's just a fundamental issue. So, you'll want to pick a strategy to cope in this case - I suggest handling this even if you don't expect it to be common as these things often come up to bite you later. Your basic choices are to either close the offending socket, or to skip data until it's cleared its output buffer - the latter choice might be more suitable for audio/video streaming, for example.
The issue which is related to your use of pipes, however, is that on Linux the size of a pipe's buffer is somewhat inflexible. It defaults to 64K since Linux 2.6.11 (the tee() call was added in 2.6.17) - see the pipe manpage. Since 2.6.35 this value can be changed via the F_SETPIPE_SZ option to fcntl() (see the fcntl manpage) up to the limit specified by /proc/sys/fs/pipe-size-max, but the buffering is still more awkward to change on-demand than a dynamically allocated scheme in user-space would be. This means that your ability to cope with slow sockets will be somewhat limited - whether this is acceptable depends on the rate at which you expect to receive and be able to send data.
Assuming this buffering strategy is acceptable, you're correct in your assumption that you'll need to track how much data each destination pipe has consumed from the source, and it's only safe to discard data which all destination pipes have consumed. This is somewhat complicated by the fact that tee() doesn't have the concept of an offset - you can only copy from the start of the pipe. The consequence of this is that you can only copy at the speed of the slowest socket, since you can't use tee() to copy to a destination pipe until some of the data has been consumed from the source, and you can't do this until all the sockets have the data you're about to consume.
How you handle this depends on the importance of your data. If you really need the speed of tee() and splice(), and you're confident that a slow socket will be an extremely rare event, you could do something like this (I've assumed you're using non-blocking IO and a single thread, but something similar would also work with multiple threads):
Make sure all pipes are non-blocking (use fcntl(d, F_SETFL, O_NONBLOCK) to make each file descriptor non-blocking).
Initialise a read_counter variable for each destination pipe to zero.
Use something like epoll() to wait until there's something in the source pipe.
Loop over all destination pipes where read_counter is zero, calling tee() to transfer data to each one. Make sure you pass SPLICE_F_NONBLOCK in the flags.
Increment read_counter for each destination pipe by the amount transferred by tee(). Keep track of the lowest resultant value.
Find the lowest resultant value of read_counter - if this is non-zero, then discard that amount of data from the source pipe (using a splice() call with a destination opened on /dev/null, for example). After discarding data, subtract the amount discarded from read_counter on all the pipes (since this was the lowest value then this cannot result in any of them becoming negative).
Repeat from step 3.
Note: one thing that's tripped me up in the past is that SPLICE_F_NONBLOCK affects whether the tee() and splice() operations on the pipes are non-blocking, and the O_NONBLOCK you set with fnctl() affects whether the interactions with other calls (e.g. read() and write()) are non-blocking. If you want everything to be non-blocking, set both. Also remember to make your sockets non-blocking or the splice() calls to transfer data to them might block (unless that's what you want, if you're using a threaded approach).
As you can see, this strategy has a major problem - as soon as one socket blocks up, everything halts - the destination pipe for that socket will fill up, and then the source pipe will become stagnant. So, if you reach the stage where tee() returns EAGAIN in step 4 then you'll want to either close that socket, or at least "disconnect" it (i.e. take it out of your loop) such that you don't write anything else to it until its output buffer is empty. Which you choose depends on whether your data stream can recovery from having bits of it skipped.
If you want to cope with network latency more gracefully then you're going to need to do more buffering, and this is going to involve either user-space buffers (which rather negates the advantages of tee() and splice()) or perhaps disk-based buffer. The disk-based buffering will almost certainly be significantly slower than user-space buffering, and hence not appropriate given that presumably you want a lot of speed since you've chosen tee() and splice() in the first place, but I mention it for completeness.
One thing that's worth noting if you end up inserting data from user-space at any point is the vmsplice() call which can perform "gather output" from user-space into a pipe, in a similar way to the writev() call. This might be useful if you're doing enough buffering that you've split your data among multiple different allocated buffers (for example if you're using a pool allocator approach).
Finally, you could imagine swapping sockets between the "fast" scheme of using tee() and splice() and, if they fail to keep up, moving them on to a slower user-space buffering. This is going to complicate your implementation, but if you're handling large numbers of connections and only a very small proportion of them are slow then you're still reducing the amount of copying to user-space that's involved somewhat. However, this would only ever be a short-term measure to cope with transient network issues - as I said originally, you've got a fundamental problem if your sockets are slower than your source. You'd eventually hit some buffering limit and need to skip data or close connections.
Overall, I would carefully consider why you need the speed of tee() and splice() and whether, for your use-case, simply user-space buffering in memory or on disk would be more appropriate. If you're confident that the speeds will always be high, however, and limited buffering is acceptable then the approach I outlined above should work.
Also, one thing I should mention is that this will make your code extremely Linux-specific - I'm not aware of these calls being support in other Unix variants. The sendfile() call is more restricted than splice(), but might be rather more portable. If you really want things to be portable, stick to user-space buffering.
Let me know if there's anything I've covered which you'd like more detail on.

Non-blocking TCP socket handling - How to detect blocking state prior to writing on to the socket?

I'm currently writing a proxy application that reads from one socket and writes on another. Both are set as non-blocking, allowing multiple sockets pairs to be handle.
To control a proper flow between the sockets, the application should NOT read from the source socket if the writing on the target socket may block.
The idea is nice, however I found no way to detect a blocking state of the target socket without first writing to it... and that is not what is needed.
I know of an option to use SIOCOUTQ (using ioctl()) and calculate the remaining buffer, but this seems ugly compared to a simple check if the target socket is ready for writing.
I guess I can also use select() for just this socket, but that is so much waste of a such heavy system call.

select or poll should be able to give you the information.
I assume you're already using one of them to detect which of your reading sockets has data
When you have a reading socket available for read, replace it with the corresponding writing socket (but put it in the write fds of course), and call select again. Then, if the writing socket is available, you can read and write.
Note that it's possible that the writing socket is ready to get data, but not as much as you want. So you might manage to read 100 bytes, and write only 50.

Thank you all for the feedback.
Summarizing all comments and answers up until now:
Directly to the question, the only known way to detect that a socket will block on an attempt to write to it is using select/poll/epoll.
The original objective was to build a proxy, which reads from one socket and writes to another, keeping proper balance between them (reading at the same rate of writing and vice versa) and using no major buffering in the application. For this, the following options were presented:
Use of SIOCOUTQ to find out how much buffer is left on the destination socket and transmit no more than that. As pointed out by #ugoren, this has a disadvantage of being unreliable, mainly because in between reading the value, calculating it and attempting to write the actual value may change. It also introduces some issues of busy-waiting if wrongly managed. I guess, if this technique is to be used, it should be followed by a more reliable one for full protection.
Use of select/poll/epoll and adding a small limited buffer per read socket: Initially, all read sockets will be added to the poll, when ready for read, we read it in a limited buffer size, then remove the read socket from the poll and add instead of it the destination socket for writing, when we return to the poll and we detect that the destination socket is ready for writing, we write the buffer, if all data was accepted by the socket, we remove the destination socket from the poll and return the read socket back. The disadvantage here is that we increase the number of system calls (adding/removing to the poll/select) and we need to keep an internal buffer per socket. This seems to be the preferred approach, and some optimization could be added to try and reduce the call to system calls (for example, trying to write right after reading the read socket, and only if something is left to perform the above).
Thank you all again for participating in this discussion, it helped me a lot ordering the ideas.

Are repeated recv() calls expensive?

I have a question about a situation that I face quite often. From time to time I have to implement various TCP-based protocols. Most of them define variable-length data packets that begin with a common header ([packet ID, length, payload] or something really similar). Obviously, there can be two approaches to reading these packets:
Read header (since header length is usually fixed), extract the payload length, read the payload
Read all available data and store it in a buffer; parse the buffer afterwards
Obviously, the first approach is simple, but requires two calls to read() (or probably more). The second one is slightly more complicated, but requires less calls.
The question is: does the first approach affect the performance badly enough to worry about it?

yes, system calls are generally expensive, compared to memory copies. IMHO it is particularly true on x86 architecture, and arguable on RISC machine (arm, mips, ...).
To be honest, unless you must handle hundreds or thousands of request per second, you will hardly notice the difference.
Depending on what is exactly the protocol, an hybrid approach could be the best. When the protocol uses a lot of small packets and less big ones, you can read the header and a partial amount of data. When it is a small packet, you win by avoiding a large memcpy, when the packet is big, you win by issuing a second syscall only for that case.

If your application is a server capable of handling multiple clients simultaneously and non-blocking sockets are used to handle multiple clients in one thread, you have little choice but to only ever issue one recv() syscall when a socket becomes ready for read.
The reason for that is if you keep calling recv() in a loop and the client sends a large volume of data, what can happen is that your recv() loop may block the thread for long time from doing anything else. E.g., recv() reads some amount of data from the socket, determines that there is now a complete message in the buffer and forwards that message to the callback. The callback processes the message somehow and returns. If you call recv() once more there can be more messages that have arrived while the callback was processing the previous message. This leads to a busy recv() loop on one socket preventing the thread from processing any other pending events.
This issue is exacerbated if the socket read buffer in your application is smaller than the kernel socket receive buffer. In other words, the whole contents of the kernel receive buffer can not be read in one recv() call. Anecdotal evidence is that I hit this issue on a busy production system when there was a 16Kb user-space buffer for a 2Mb kernel socket receive buffer. A client sending many messages in succession would block the thread in that recv() loop for minutes because more messages would arrive when the just read messages were being processed, leading to disruption of the service.
In such event-driven architectures it is best to have the user-space read buffer equal to the size of the kernel socket receive buffer (or the maximum message size, whichever is bigger), so that all the data available in the kernel buffer can be read in one recv() call. This works by doing one recv() call, processing all complete messages in the user-space read buffer and then returning control to the event loop. This way a connections with a lot of incoming data is not blocking the thread from processing other events and connections, rather it round-robin's processing of all connections with incoming data available.

The best way to get your answer is to measure. The strace program is decent for the purpose of measuring system call times. Using it adds a lot of overhead in itself, but if you merely compare the cost of one recv for this purpose versus the cost of two, it should be reasonably meaningful. Use the -tt option to get times. Or you can use the -c option to get an overview of time spent separated by which syscall it was spent on.
A better way to measure, albeit with more of a learning curve, is oprofile.
Also note that if you do decide buffering is worthwhile, you may be able to use fdopen and the stdio functions to take care of it for you. This is extremely easy and will work well if you're only dealing with a single connection or if you have a thread/process per connection, but won't work at all if you want to use a select/poll-based model.

Note that you generally have to "read all the available data into a buffer and process it afterwards" anyway, to account for the (unlikely, but possible) scenario where a recv() call returns only part of your header - so you might as well go the whole hog and use option 2.

Yes, depending upon the scenario the read/recv calls may be expensive. For example, if you are issuing huge number of recv() calls to read very small amount of data every small interval, it would be a performance hit. In such scenario you could issue a recv() with reasonably large buffer, let say 4k, and then parse that 4k buffer. It may contain multiple header+data combo. By reading header first you can find the data and its length. And to avoid the mem copy of data into a new buffer, you can just use the offset from where the actual data start, and store that pointer.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight