Send data to multiple sockets using pipes, tee() and splice()

Send data to multiple sockets using pipes, tee() and splice() - c

I'm duplicating a "master" pipe with tee() to write to multiple sockets using splice(). Naturally these pipes will get emptied at different rates depending on how much I can splice() to the destination sockets. So when I next go to add data to the "master" pipe and then tee() it again, I may have a situation where I can write 64KB to the pipe but only tee 4KB to one of the "slave" pipes. I'm guessing then that if I splice() all of the "master" pipe to the socket, I will never be able to tee() the remaining 60KB to that slave pipe. Is that true? I guess I can keep track of a tee_offset (starting at 0) which I set to the start of the "unteed" data and then don't splice() past it. So in this case I would set tee_offset to 4096 and not splice more than that until I'm able to tee it to all to the other pipes. Am I on the right track here? Any tips/warnings for me?

If I understand correctly, you've got some realtime source of data that you want to multiplex to multiple sockets. You've got a single "source" pipe hooked up to whatever's producing your data, and you've got a "destination" pipe for each socket over which you wish to send the data. What you're doing is using tee() to copy data from the source pipe to each of the destination pipes and splice() to copy it from the destination pipes to the sockets themselves.
The fundamental issue you're going to hit here is if one of the sockets simply can't keep up - if you're producing data faster than you can send it, then you're going to have a problem. This isn't related to your use of pipes, it's just a fundamental issue. So, you'll want to pick a strategy to cope in this case - I suggest handling this even if you don't expect it to be common as these things often come up to bite you later. Your basic choices are to either close the offending socket, or to skip data until it's cleared its output buffer - the latter choice might be more suitable for audio/video streaming, for example.
The issue which is related to your use of pipes, however, is that on Linux the size of a pipe's buffer is somewhat inflexible. It defaults to 64K since Linux 2.6.11 (the tee() call was added in 2.6.17) - see the pipe manpage. Since 2.6.35 this value can be changed via the F_SETPIPE_SZ option to fcntl() (see the fcntl manpage) up to the limit specified by /proc/sys/fs/pipe-size-max, but the buffering is still more awkward to change on-demand than a dynamically allocated scheme in user-space would be. This means that your ability to cope with slow sockets will be somewhat limited - whether this is acceptable depends on the rate at which you expect to receive and be able to send data.
Assuming this buffering strategy is acceptable, you're correct in your assumption that you'll need to track how much data each destination pipe has consumed from the source, and it's only safe to discard data which all destination pipes have consumed. This is somewhat complicated by the fact that tee() doesn't have the concept of an offset - you can only copy from the start of the pipe. The consequence of this is that you can only copy at the speed of the slowest socket, since you can't use tee() to copy to a destination pipe until some of the data has been consumed from the source, and you can't do this until all the sockets have the data you're about to consume.
How you handle this depends on the importance of your data. If you really need the speed of tee() and splice(), and you're confident that a slow socket will be an extremely rare event, you could do something like this (I've assumed you're using non-blocking IO and a single thread, but something similar would also work with multiple threads):
Make sure all pipes are non-blocking (use fcntl(d, F_SETFL, O_NONBLOCK) to make each file descriptor non-blocking).
Initialise a read_counter variable for each destination pipe to zero.
Use something like epoll() to wait until there's something in the source pipe.
Loop over all destination pipes where read_counter is zero, calling tee() to transfer data to each one. Make sure you pass SPLICE_F_NONBLOCK in the flags.
Increment read_counter for each destination pipe by the amount transferred by tee(). Keep track of the lowest resultant value.
Find the lowest resultant value of read_counter - if this is non-zero, then discard that amount of data from the source pipe (using a splice() call with a destination opened on /dev/null, for example). After discarding data, subtract the amount discarded from read_counter on all the pipes (since this was the lowest value then this cannot result in any of them becoming negative).
Repeat from step 3.
Note: one thing that's tripped me up in the past is that SPLICE_F_NONBLOCK affects whether the tee() and splice() operations on the pipes are non-blocking, and the O_NONBLOCK you set with fnctl() affects whether the interactions with other calls (e.g. read() and write()) are non-blocking. If you want everything to be non-blocking, set both. Also remember to make your sockets non-blocking or the splice() calls to transfer data to them might block (unless that's what you want, if you're using a threaded approach).
As you can see, this strategy has a major problem - as soon as one socket blocks up, everything halts - the destination pipe for that socket will fill up, and then the source pipe will become stagnant. So, if you reach the stage where tee() returns EAGAIN in step 4 then you'll want to either close that socket, or at least "disconnect" it (i.e. take it out of your loop) such that you don't write anything else to it until its output buffer is empty. Which you choose depends on whether your data stream can recovery from having bits of it skipped.
If you want to cope with network latency more gracefully then you're going to need to do more buffering, and this is going to involve either user-space buffers (which rather negates the advantages of tee() and splice()) or perhaps disk-based buffer. The disk-based buffering will almost certainly be significantly slower than user-space buffering, and hence not appropriate given that presumably you want a lot of speed since you've chosen tee() and splice() in the first place, but I mention it for completeness.
One thing that's worth noting if you end up inserting data from user-space at any point is the vmsplice() call which can perform "gather output" from user-space into a pipe, in a similar way to the writev() call. This might be useful if you're doing enough buffering that you've split your data among multiple different allocated buffers (for example if you're using a pool allocator approach).
Finally, you could imagine swapping sockets between the "fast" scheme of using tee() and splice() and, if they fail to keep up, moving them on to a slower user-space buffering. This is going to complicate your implementation, but if you're handling large numbers of connections and only a very small proportion of them are slow then you're still reducing the amount of copying to user-space that's involved somewhat. However, this would only ever be a short-term measure to cope with transient network issues - as I said originally, you've got a fundamental problem if your sockets are slower than your source. You'd eventually hit some buffering limit and need to skip data or close connections.
Overall, I would carefully consider why you need the speed of tee() and splice() and whether, for your use-case, simply user-space buffering in memory or on disk would be more appropriate. If you're confident that the speeds will always be high, however, and limited buffering is acceptable then the approach I outlined above should work.
Also, one thing I should mention is that this will make your code extremely Linux-specific - I'm not aware of these calls being support in other Unix variants. The sendfile() call is more restricted than splice(), but might be rather more portable. If you really want things to be portable, stick to user-space buffering.
Let me know if there's anything I've covered which you'd like more detail on.

Related

Flushing pipe without closing in C

I have found a lot of threads in here asking about how it is possible to flush a pipe after writing to it without closing it.
In every thread I could see different suggestions but i could not find a definite solution.
Here is a quick summary:
The easiest way to avoid read blocking on the pipe would be to write the exact number of bytes that is reading.
It could be also done by using ptmx instead of a pipe but people said it could be to much.
Note: It's not possible to use fsync with pipes
Are there any other more efficient solutions?
Edit:
The flush would be convenient when the sender wants to write n characters but the client reads m characters (where m>n). The client will block waiting for another m-n characters. If the sender wants to communicate again with the client leaves him without the option of closing the pipe and just sending the exact number of bytes could be a good source of bugs.
The receiver operates like this and it cannot be modified:
while((n=read(0, buf, 100)>0){
process(buf)
so the sender wants to get processed: "file1" and "file2" for which will have to:
write(pipe[1], "file1\0*95", 100);
write(pipe[1], "file2\0*95", 100);
what I am looking is for a way to do something like that (without being necessary to use the \n as the delimeter):
write(pipe[1], "file1\nfile2", 11); //it would have worked if it was ptmx
(Using read and write)

Flushing in the sense of fflush() is irrelevant to pipes, because they are not represented as C streams. There is therefore no userland buffer to flush. Similarly, fsync() is also irrelevant to pipes, because there is no back-end storage for the data. Data successfully written to a pipe exist in the kernel and only in the kernel until they are successfully read, so there is no work for fsync() to do. Overall, flushing / syncing is applicable only where there is intermediate storage involved, which is not the case with pipes.
With the clarification, your question seems to be about establishing message boundaries for communication via a pipe. You are correct that closing the write end of the pipe will signal a boundary -- not just of one message, but of the whole stream of communication -- but of course that's final. You are also correct that there are no inherent message boundaries. Nevertheless, you seem to be working from at least somewhat of a misconception here:
The easiest way to avoid read blocking on the pipe would be to write
the exact number of bytes that is reading.
[...]
The flush would be convenient when the sender wants to write n
characters but the client reads m characters (where m>n). The client
will block waiting for another m-n characters.
Whether the reader will block is entirely dependent on how the reader is implemented. In particular, the read(2) system call in no way guarantees to transfer the requested number of bytes before returning. It can and will perform a short read under some circumstances. Although the details are unspecified, you can ordinarily rely on a short read when at least one character can be transferred without blocking, but not the whole number requested. Similar applies to write(2). Thus, the easiest way to avoid read() blocking is to ensure that you write at least one byte to the pipe for that read() call to transfer.
In fact, people usually come at this issue from the opposite direction: needing to be certain to receive a specific number of bytes, and therefore having to deal with the potential for short reads as a complication (to be dealt with by performing the read() in a loop). You'll need to consider that, too, but you have the benefit that your client is unlikely to block under the circumstances you describe; it just isn't the problem you think it is.
There is an inherent message-boundary problem in any kind of stream communication, however, and you'll need to deal with it. There are several approaches; among the most commonly used are
Fixed-length messages. The receiver can then read until it successfully transfers the required number of bytes; any blocking involved is appropriate and needful. With this approach, the scenario you postulate simply does not arise, but the writer might need to pad its messages.
Delimited messages. The receiver then reads until it finds that it has received a message delimiter (a newline or a null byte, for example). In this case, the receiver will need to be prepared for the possibility of message boundaries not being aligned with the byte sequences transferred by read() calls. Marking the end of a message by closing the channel can be considered a special case of this alternative.
Embedded message-length metadata. This can take many forms, but one of the simplest is to structure messages as a fixed-length integer message length field, followed by that number of bytes of message data. The reader then knows at every point how many bytes it needs to read, so it will not block needlessly.
These can be used individually or in combination to implement an application-layer protocol for communicating between your processes. Naturally, the parties to the communication must agree on the protocol details for communication to be successful.

Is level triggered or edge triggered more performant?

I am trying to figure out what is more performant, edge triggered or level triggered epoll.
Mainly I am considering "performant" as:
Ability to handle multiple connections without degredation.
Ability to keep the uptmost speed per inbound message.
I am actually more concerned about #2, but #1 is also important.
I've been running tests with a single threaded consumer (accept/read multiple socket connections using epoll_wait), and multiple producers.
So far I've seen no difference, even up to 1000 file descriptors.
I've been laboring under the idea (delusion?) that edge triggered should be more performant because less interupts will be received. Is this a correct assumption?
One issue with my test, that might be masking performance differences, is that I don't dispatch my messages to threads once they are received, so the less interrupts don't really matter. I've been loath to do this test because I've been using __asm__ rdtsc to get my "timestamps," so I don't want to have to reconcile what core my original timestamp came from.
What makes me even more suspicious is that level triggered epoll performs better on some benchmarks I've seen.
Which is better? Under what circumstances? Is there no difference? Any insights would be appreciated.
My sockets are non-blocking.

I wouldn't expect to see a huge performance difference between edge and level triggered.
For edge-triggered you always have to drain the input buffer, so you have one useless (just returning EWOULDBLOCK) recv syscall. But for level triggered you might use more epoll_wait syscalls. As the man page points out, avoiding starvation might be slightly easier in level triggered mode.
The real difference is that when you want to use multiple threads you'll have to use edge-triggered mode (although you'll still have to be careful with getting synchronization right).

The difference is only visible when you use long-lived sessions and you're forced to constantly stop/start because of buffers full/empty (typically with a proxy). When you're doing this, you most often need an event cache, and when your event cache is processing events, you can use ET and avoid all the epoll_ctl(DEL)+epoll_ctl(ADD) dance. For short-lived sessions, the savings are less obvious, because for ET you'll need at least one epoll_ctl(ADD) call to enable polling on the FD, and if you don't expect to have more of them during the session's life (eg: exchanges are smaller than buffers most of the time), then you shouldn't expect any difference. Most of your savings will generally come from using an event cache only since you can often perform a lot of operations (eg: writes) without polling thanks to kernel buffers.

When used as an edge-triggered interface, for performance reasons, it
is possible to add the file descriptor inside the epoll interface
(EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT). This allows you
to avoid continuously switching between EPOLLIN and EPOLLOUT calling
epoll_ctl(2) with EPOLL_CTL_MOD.
Q9 Do I need to continuously read/write a file descriptor until EAGAIN
when using the EPOLLET flag (edge-triggered behavior) ?
A9 Receiving an event from epoll_wait(2) should suggest to you that
such file descriptor is ready for the requested I/O operation. You
must consider it ready until the next (nonblocking) read/write
yields EAGAIN. When and how you will use the file descriptor is
entirely up to you.
For packet/token-oriented files (e.g., datagram socket, terminal in
canonical mode), the only way to detect the end of the read/write
I/O space is to continue to read/write until EAGAIN.
For stream-oriented files (e.g., pipe, FIFO, stream socket), the
condition that the read/write I/O space is exhausted can also be
detected by checking the amount of data read from / written to the
target file descriptor. For example, if you call read(2) by asking
to read a certain amount of data and read(2) returns a lower number
of bytes, you can be sure of having exhausted the read I/O space
for the file descriptor. The same is true when writing using
write(2). (Avoid this latter technique if you cannot guarantee
that the monitored file descriptor always refers to a stream-ori‐
ented file.)

What amount of data does select (2) guarantee to be able to be written to a file without blocking

select (2) (amongst other things) tells me whether I can write to a fd of a file without blocking. However, does it guarentee me that I can write a full 4096 bytes without blocking?
Note I am interested in normal files on disk. Not sockets or the like.
In other words: does select signal when we can just write one single byte to a file fd without blocking, or does it signal when we can write n (4096, ... ?) bytes to a file fd without blocking.

Whenever select() indicates that your file is ready, you can try writing N bytes, for any N>0. write() will return the number of bytes actually written. If it equals N, you can write again. If it's less than N, then the next write will block.
Note Normal files on disk don't block. Sockets, pipes and terminals do.

You tagged this "Linux", so what does the kernel source code tell you? It should be pretty easy to read the syscall implementation to find when select decides to treat a file descriptor as ready for writing.
If you're worried about blocking, though, you're doing it wrong. If you don't want to block, use O_NONBLOCK or equivalents. Even if select did guarantee a certain number of bytes could be written without blocking, that would only be true at the time select returns; it might not necessarily be true by the time you actually perform the write.

Note I am interested in normal files on disk. Not sockets or the like.
select does not "work" with normal files, only sockets/pipes/ttys and possibly others, but not regular files. For regular files select will always signal the file descriptor as readable/writable - thus it is a rather useless exercise to use select with files.
note that that applies to other io multiplexing facilities as well, such as poll/epoll. AIO will do asynchonous io to regular files, but operating system support might vary, and it is a rather complex api to use
As to how much data you can write, there is no promise. 4096 is no magical number that select assumes you can write without blocking, when applied to filedescriptors where using select does make sense (sockets/pipes/etc.) . Because you can't know how much data you can write without blocking, you should always set the file descriptor to non-blocking, record how much was actually written as indicated by the return value of write/send and start writing from that point the next time select indicates you can write data again.

select() only promises that the applicable call can be made without blocking, it does not guarantee an I/O amount (4096) in your case. Since select() can be used with different types of descriptors (file, sockets, serial connections, etc.) you may notice that for disk operations the observed behavior is that a full buffer can always be written, but again this is specific to the particular underlying operation and not a promise of select().

Are repeated recv() calls expensive?

I have a question about a situation that I face quite often. From time to time I have to implement various TCP-based protocols. Most of them define variable-length data packets that begin with a common header ([packet ID, length, payload] or something really similar). Obviously, there can be two approaches to reading these packets:
Read header (since header length is usually fixed), extract the payload length, read the payload
Read all available data and store it in a buffer; parse the buffer afterwards
Obviously, the first approach is simple, but requires two calls to read() (or probably more). The second one is slightly more complicated, but requires less calls.
The question is: does the first approach affect the performance badly enough to worry about it?

yes, system calls are generally expensive, compared to memory copies. IMHO it is particularly true on x86 architecture, and arguable on RISC machine (arm, mips, ...).
To be honest, unless you must handle hundreds or thousands of request per second, you will hardly notice the difference.
Depending on what is exactly the protocol, an hybrid approach could be the best. When the protocol uses a lot of small packets and less big ones, you can read the header and a partial amount of data. When it is a small packet, you win by avoiding a large memcpy, when the packet is big, you win by issuing a second syscall only for that case.

If your application is a server capable of handling multiple clients simultaneously and non-blocking sockets are used to handle multiple clients in one thread, you have little choice but to only ever issue one recv() syscall when a socket becomes ready for read.
The reason for that is if you keep calling recv() in a loop and the client sends a large volume of data, what can happen is that your recv() loop may block the thread for long time from doing anything else. E.g., recv() reads some amount of data from the socket, determines that there is now a complete message in the buffer and forwards that message to the callback. The callback processes the message somehow and returns. If you call recv() once more there can be more messages that have arrived while the callback was processing the previous message. This leads to a busy recv() loop on one socket preventing the thread from processing any other pending events.
This issue is exacerbated if the socket read buffer in your application is smaller than the kernel socket receive buffer. In other words, the whole contents of the kernel receive buffer can not be read in one recv() call. Anecdotal evidence is that I hit this issue on a busy production system when there was a 16Kb user-space buffer for a 2Mb kernel socket receive buffer. A client sending many messages in succession would block the thread in that recv() loop for minutes because more messages would arrive when the just read messages were being processed, leading to disruption of the service.
In such event-driven architectures it is best to have the user-space read buffer equal to the size of the kernel socket receive buffer (or the maximum message size, whichever is bigger), so that all the data available in the kernel buffer can be read in one recv() call. This works by doing one recv() call, processing all complete messages in the user-space read buffer and then returning control to the event loop. This way a connections with a lot of incoming data is not blocking the thread from processing other events and connections, rather it round-robin's processing of all connections with incoming data available.

The best way to get your answer is to measure. The strace program is decent for the purpose of measuring system call times. Using it adds a lot of overhead in itself, but if you merely compare the cost of one recv for this purpose versus the cost of two, it should be reasonably meaningful. Use the -tt option to get times. Or you can use the -c option to get an overview of time spent separated by which syscall it was spent on.
A better way to measure, albeit with more of a learning curve, is oprofile.
Also note that if you do decide buffering is worthwhile, you may be able to use fdopen and the stdio functions to take care of it for you. This is extremely easy and will work well if you're only dealing with a single connection or if you have a thread/process per connection, but won't work at all if you want to use a select/poll-based model.

Note that you generally have to "read all the available data into a buffer and process it afterwards" anyway, to account for the (unlikely, but possible) scenario where a recv() call returns only part of your header - so you might as well go the whole hog and use option 2.

Yes, depending upon the scenario the read/recv calls may be expensive. For example, if you are issuing huge number of recv() calls to read very small amount of data every small interval, it would be a performance hit. In such scenario you could issue a recv() with reasonably large buffer, let say 4k, and then parse that 4k buffer. It may contain multiple header+data combo. By reading header first you can find the data and its length. And to avoid the mem copy of data into a new buffer, you can just use the offset from where the actual data start, and store that pointer.

How can I buffer non-blocking IO?

When I need buffered IO on blocking file descriptor I use stdio. But if I turn file descriptor into non-blocking mode according to manual stdio buffering is unusable. After some research I see that BIO can be usable for buffering non-blocking IO.
But may be there are other alternatives?
I need this to avoid using threads in a multi-connection environment.

I think what you are talking about is the Reactor Pattern. This is a pretty standard way of processing lots of network connections without threads, and is very common in multiplayer game server engines. Another implementation (in python) is twisted matrix.
The basic algorith is:
have a buffer for each socket
check which sockets are ready to read (select(), poll(), or just iterate)
for each socket:
call recv() and accumulate the contents into the socket's buffer until recv returns 0 or an error with EWOULDBLOCK
call application level data handler for the socket with the contents of the buffer
clear the socket's buffer

I see the question has been edited now, and is at least more understandable than before.
Anyway, isn't this a contradiction?
You make I/O non-blocking because you want to be able to read small amounts quickly, typically sacrificing throughput for latency.
You make it buffered because you don't care that much about latency, but want to make efficient use of the I/O subsystem by trading latency for throughput.
Doing them both at the same time seems like a contradiction, and is hard to imagine.
What are the semantics you're after? If you do this:
int fd;
char buf[1024];
ssize_t got;
fd = setup_non_blocking_io(...);
got = read(fd, buf, sizeof buf);
What behavior do you expect if there is 3 bytes available? Blocking/buffered I/O might block until able to read more satisfy your request, non-blocking I/O would return the 3 available bytes immediately.
Of course, if you have some protocol on top, that defines some kind of message structure so that you can know that "this I/O is incomplete, I can't parse it until I have more data", you can buffer it yourself at that level, and not pass data on upwards until a full message has been received.

Depending on the protocol, it is certainly possible that you will need to buffer your reads for a non-blocking network node (client or server).
Typically, these buffers provide multiple indexes (offsets) that both record the position of the last byte processed and last byte read (which is either the same or greater than the processed offset). And they also (should) provide richer semantics of compacting the buffer, transparent buffer size management, etc.
In Java (at least) the non-blocking network io (NIO) packages also provide a set of data structures (ByteBuffer, etc.) that are geared towards providing a general data structure.
There either exists such data structures for C, or you must roll your own. Once you have it, then simply read as much data as available and let the buffer manage issues such as overflow (e.g. reading bytes across message frame boundaries) and use the marker offset to mark off the bytes that you have processed.
As Android pointed out, you will (very likely) need to create matched buffers for each open connection.

You could create a struct with buffers for each open file descriptor, then accumulate these buffers until recv() returns 0 or you have data enough to process in your buffer.
If I understand your question correctly, you can't buffer because with non-blocking you're writing to the same buffer with multiple connections (if global) or just writing small pieces of data (if local).
In any case, your program has to be able to identify where the data is coming (possibly by file descriptor) from and buffer it accordingly.
Threading is also an option, it's not as scary as many make it sound out to be.

Ryan Dahl's evcom library which does exactly what you wanted.
I use it in my job and it works great. Be aware, though, that it doesn't (yet, but coming soon) have async DNS resolving. Ryan suggests udns by Michael Tokarev for that. I'm trying to adopt udns instead of blocking getaddrinfo() now.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight