Which method do you suggest for reading multicast stream in Linux? - c

I have written a program in Linux using C/C++ that reads multicast packets and tries to understand that a specific event occurred or not as quickly as possible. Latency is the key point here.
In the protocol, first two bytes represent the message type.
In my current implementation, I read the first two bytes and decide how many bytes I should read for the payload according to the message type. Namely, I perform 2 read operations for 1 packet. One of the read operations is for the packet length and the other is for the payload. So, there are 2 I/O operations.
Alternatively, I could do that, I read as much as I can, check the first 2 bytes, let's say it is N, go for N bytes and form the packet1 and packet2. If there are remaining bytes after reading packet1 and packet2, read more bytes and again process the byte buffer as above. In this method, I do 1 I/O but it is required to traverse in the byte buffer.
Which one is faster theoretically? I know I must implement and measure both but I just wanted to hear your suggestions.
Thanks

The fastest method I know of is:
Open a raw packet socket (AF_PACKET)
Implement a BPF filter, that filters the packets you need as specific as possible
Switch to a memory-mapped ringbuffer (PACKET_MMAP/PACKET_RX_RING)
Read the packets directly from memory instead of using recv(). This can be done using poll() or, alternatively, by busy-looping over the in-memory packet meta-data to avoid the poll() syscall.
Process the packet directly in the ring-buffer (zero-copy)
Mark the buffer as "free for reuse"
This way, no syscalls at all are necessary, the path through the kernel is short and the latency should be minimal.
For more information, see the packet mmap kernel documentation

Related

How large should UDP or TCP buffer size be on Linux?

I need to write a couple of C++ applications on Linux, one to receive data via UDP and the second TCP.
The only thing I'm unsure about is regarding the buffer.
How do I choose what size buffer?
If I make the buffer large enough, am I guaranteed to avoid scenarios where half of a packet is at the end of my buffer and I need to copy the bytes to the beginning and then receive the remaining half of the packet?
I am going to use the Linux socket API functions if it matters.
If I make the buffer large enough, am I guaranteed to avoid scenarios
where half of a packet is at the end of my buffer and I need to copy
the bytes to the beginning and then receive the remaining half of the
packet?
Based on the above paragraph, I'm going to surmise that the buffer you are referring to is the application-space buffer that you pass into your recv() calls, and not the in-kernel buffer that the networking stack maintains on your application's behalf.
For UDP, the answer is simple: Your buffer needs to be large enough to hold the largest possible datagram you expect to receive. Since UDP datagrams are typically less than 1500 bytes (to avoid fragmentation) and in all cases are <= 65507 bytes (since that is the maximum datagram size the UDP protocol supports), you can always make your receive buffer 65507 bytes long, or smaller if you want to save a bit on RAM usage.
For TCP, the protocol is stream-based, so the amount of data written in to your recv-buffer by a given recv() call is unrelated to packet sizes. Another consequence of TCP being stream-based is that it doesn't do any message-framing -- that means you will have to handle partial messages regardless of how big or small you make your buffer. The only advantage of a larger TCP buffer is that it's a bit more efficient to handle more bytes at a time instead of fewer, again at the cost of using a little more RAM.
If I make the buffer large enough, am I guaranteed to avoid scenarios where half of a packet is at the end of my buffer and I need to copy the bytes to the beginning and then receive the remaining half of the packet?
For TCP: It doesn't matter. Packets are an implementation detail. The application doesn't even have to think about them. TCP is a byte-stream protocol and all you ever get from the API is a stream of bytes. Message boundaries are never preserved.
For UDP: Packets are still an implementation detail. You send and receive datagrams. Your read function always gets an entire datagram so long as your buffer is as large as the largest datagram your application protocol supports.

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?
PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.
A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

C - Proper size for a buffer to be send via TCP

I'm writing a C client-server application.
The two sides exchange char buffer in order to communicate.
What is the proper size for these buffers?
Does exist a limit of bytes readable (or writable) by a read() (or a write()) on a stream-oriented socket?
Provided you write the code correctly there is no limit as long the connection is maintained. That's what a stream connection means.
Just remember that write() and read() can both return before they have written/read all of the data you provided/asked for. In that case the return value tells you how much was written/read, and it's your responsibility to call the function again to write/read any more.
It depends if you are aiming for high throughput or low latency. Big buffers for high throughput and small buffers for low latency. Note also that when sending a buffer with x Bytes the read and write functions do not guarantee to send all the x bytes. Make sure to check the return value to see how many bytes was send/received continue sending/receiving the rest (this is often done with a while loop until you send/receive the whole buffer-size x).

Socket and buffers

I know standard c library functions fwrite and fread are a sort buffering wrappers of write and read system calls, buffers are used for performance reasons I totally understand.
what I don't understand is the role of buffers in socket programming functions write and read.
can you help me understand what they are used for, highlighting differences and similarities with files buffers?
I'm a newbie in socket programming...
When the kernel receives packets it has to put those data somewhere. It stores it in the buffer. When your app does the next read it can fetch the data from those buffers. If you have a UDP connection and your app doesn't read those buffers it gets full and the kernel starts to drop the received packets. If you have a TCP connection it will acknowledge the packets as long as there is free space in the buffers, but after that it will signal that it cannot read more.
Write buffers are necessary because the network interface is a scarce resource, the kernel typically cannot send immediately a packet. If you do a big write() it could be chopped up to hundreds of packets. So the kernel will store that data in the buffers. The buffer also does a good job if you do a lot of small writes, see Nagle's algorithm.
Imagine if you sent your information one byte at a time. You'd be generating a 100 byte packet to send 1 byte, and if it's a TCP connection, depending in implementation, waiting until you got a syn/ack before sending more? Sounds pretty inefficient to me.
Instead, you use a buffer to store up a large amount of data and send that across in a single packet, just like storing up data before writing to disk.

How can I buffer non-blocking IO?

When I need buffered IO on blocking file descriptor I use stdio. But if I turn file descriptor into non-blocking mode according to manual stdio buffering is unusable. After some research I see that BIO can be usable for buffering non-blocking IO.
But may be there are other alternatives?
I need this to avoid using threads in a multi-connection environment.
I think what you are talking about is the Reactor Pattern. This is a pretty standard way of processing lots of network connections without threads, and is very common in multiplayer game server engines. Another implementation (in python) is twisted matrix.
The basic algorith is:
have a buffer for each socket
check which sockets are ready to read (select(), poll(), or just iterate)
for each socket:
call recv() and accumulate the contents into the socket's buffer until recv returns 0 or an error with EWOULDBLOCK
call application level data handler for the socket with the contents of the buffer
clear the socket's buffer
I see the question has been edited now, and is at least more understandable than before.
Anyway, isn't this a contradiction?
You make I/O non-blocking because you want to be able to read small amounts quickly, typically sacrificing throughput for latency.
You make it buffered because you don't care that much about latency, but want to make efficient use of the I/O subsystem by trading latency for throughput.
Doing them both at the same time seems like a contradiction, and is hard to imagine.
What are the semantics you're after? If you do this:
int fd;
char buf[1024];
ssize_t got;
fd = setup_non_blocking_io(...);
got = read(fd, buf, sizeof buf);
What behavior do you expect if there is 3 bytes available? Blocking/buffered I/O might block until able to read more satisfy your request, non-blocking I/O would return the 3 available bytes immediately.
Of course, if you have some protocol on top, that defines some kind of message structure so that you can know that "this I/O is incomplete, I can't parse it until I have more data", you can buffer it yourself at that level, and not pass data on upwards until a full message has been received.
Depending on the protocol, it is certainly possible that you will need to buffer your reads for a non-blocking network node (client or server).
Typically, these buffers provide multiple indexes (offsets) that both record the position of the last byte processed and last byte read (which is either the same or greater than the processed offset). And they also (should) provide richer semantics of compacting the buffer, transparent buffer size management, etc.
In Java (at least) the non-blocking network io (NIO) packages also provide a set of data structures (ByteBuffer, etc.) that are geared towards providing a general data structure.
There either exists such data structures for C, or you must roll your own. Once you have it, then simply read as much data as available and let the buffer manage issues such as overflow (e.g. reading bytes across message frame boundaries) and use the marker offset to mark off the bytes that you have processed.
As Android pointed out, you will (very likely) need to create matched buffers for each open connection.
You could create a struct with buffers for each open file descriptor, then accumulate these buffers until recv() returns 0 or you have data enough to process in your buffer.
If I understand your question correctly, you can't buffer because with non-blocking you're writing to the same buffer with multiple connections (if global) or just writing small pieces of data (if local).
In any case, your program has to be able to identify where the data is coming (possibly by file descriptor) from and buffer it accordingly.
Threading is also an option, it's not as scary as many make it sound out to be.
Ryan Dahl's evcom library which does exactly what you wanted.
I use it in my job and it works great. Be aware, though, that it doesn't (yet, but coming soon) have async DNS resolving. Ryan suggests udns by Michael Tokarev for that. I'm trying to adopt udns instead of blocking getaddrinfo() now.

Resources