Buffering characteristics of Unix Sockets - c

Does anyone know the buffering characteristics of Unix sockets when sending small chunks of data (a few bytes)?
When I am using TCP sockets, I can disable the Nagle algorithm to prevent latency in data transit. But there's no equivalent functionality (that I know of) for Unix Domain sockets.

There is no nagle algorithm available on unix domain sockets.
Unix sockets are normally implemented as a memory buffer in the operating system kernel. Once you've written/sent data on the socket, it is copied into that buffer, and becomes immediately available to the peer.

Related

Working of Raw Sockets in the Linux kernel

I'm working on integrating the traffic control layer of the linux kernel to a custom user-level network stack. I'm using raw sockets to do the same. My question is if we use raw sockets with AF_PACKET, RAW_SOCK, and IPPROTO_RAW, will the dev_queue_xmit (the function which is the starting point of the Queueing layer as far as I've read) be called? Or does the sockets interface directly call the network card driver?
SOCK_RAW indicates that the userspace program should receive the L2 (link-layer) header in the message.
IPPROTO_RAW applies the same for the L3 (IP) header.
A userspace program sets SOCK_RAW, IPPROTO_RAW to manually parse or/and compose protocol headers of a packet. It guarantees that the kernel doesn't modify the corresponding layer header on the way to/from the userspace. The raw socket doesn't change the way the packet gets received or transmitted - those are queued as usual. From the network driver perspective, it doesn't matter who set the headers - the userspace (raw sockets) or the kernel (e.g., SOCK_DGRAM).
Keep in mind that getting raw packets requires CAP_NET_RAW capability - usually, the program needs to run with superuser privileges.

zero copy udp socket using sendfile instead of sendto

I'm working with udp sockets in real time environment. I am currently using standard socket functions sendto() which takes relatively a lot of time. I read that it is possible to use zero copy, that, if I understand well, avoids having extra time added by copying files from user/kernel environment. However, I see that sendfile() allows only to copy from one file descriptor to another. I can't see how I can use that to send UDP packets, which in my case is a buffer. So my question is:
is it even possible to use sendfile() to send UDP packets ?
If so,what is the correct way of doing this ?
Edit
I am working on a real time platform where I have several operations plus the sending over the socket. All of these should not take more than 1ms. I tried on three machines, the first has 4 cores at 3.4GHz, the other 8 cores at 2.3GHz and the last one 4 cores at 1.4GHz. On the first one it takes less than 1µs to send a 720 bytes packet. While on the two others it is between 6 and 9µs. I'm using a linux low latency kernel, and deactivated all CPU power management features, so all the CPUs are at max frequency.
I noticed that if the time taken by sendto() is larger than 6µs, the platform simply does not work. One other precision, I have several threads running in parallel. So maybe it is just the CPU processing other threads while the sendto() has not completed yet. I'm wondering it this is possible, to stop the sendto() while in process to do someting else.
This is why I was trying to find other solutions to do optimization somewhere else, and I thought that using sendfile() would avoid additional times to be saved.
I am not sure if sendfile works with UDP sockets, however, memfd_create creates a file descriptor from memory and theoretically could allow bypassing the copying from the user space to the kernel.
Still though, when sending the kernel has to copy the data into the kernel socket buffer first because it needs to prepend user data with UDP, IP and Ethernet headers, which cannot be done in-place. This copying cannot be avoided even when using sendfile.
To do real zero-copy networking you may like to have a look at PF_RING ZC (Zero Copy) drivers:
On-Demand Kernel Bypass with PF_RING Aware Drivers
PF_RING™ ZC comes with a new generation of PF_RING™ aware drivers that can be used both in kernel or bypass mode. Once installed, the drivers operate as standard Linux drivers where you can do normal networking (e.g. ping or SSH). When used from PF_RING™ they are quicker than vanilla drivers, as they interact directly with it. If you open a device using a PF_RING-aware driver in zero copy (e.g. pfcount -i zc:eth1) the device becomes unavailable to standard networking as it is accessed in zero-copy through kernel bypass, as happened with the predecessor DNA. Once the application accessing the device is closed, standard networking activities can take place again.

Low latency packet processing with shared memory on Linux?

If I was to receive UDP packets on Linux (and I didn't mind changing some of the source code) what would be the fastest way for my application to read the packets?
Would I want to modify the network stack so that once a UDP packet is received it is written to shared memory and have the application access that memory?
Would there be any way for the stack to notify the application to react, rather than have the application continuously poll the shared memory?
Any advice/further resources are welcome- I have only seen:
http://www.kegel.com/c10k.html
If latency is a problem and the default UDP network stack does not perform as you wish, then try to use different existing (installable) network stacks.
Example, try UDP Lite, compare to the standard UDP stack, this particular stack does not perform any checksum on the UDP datagram, thus reducing latencies at the cost of providing corrupted datagram to the application layer.
Side note: you do not need to have a "polling" mechanism. Read the manual of select (and it's possible derivative like pselect or ppoll), with such API, the kernel will "wake up" your application as soon as it has something to read or write in the pipeline.

Does recv remove packets from pcaps buffer?

Say there are two programs running on a computer (for the sake of simplification, the only user programs running on linux) one of which calls recv(), and one of which is using pcap to detect incoming packets. A packet arrives, and it is detected by both the program using pcap, and by the program using recv. But, is there any case (for instance recv() returning between calls to pcap_next()) in which one of these two will not get the packet?
I really don't understand how the buffering system works here, so the more detailed explanation the better - is there any conceivable case in which one of these programs would see a packet that the other does not? And if so, what is it and how can I prevent it?
AFAIK, there do exist cases where one would receive the data and the other wouldn't (both ways). It's possible that I've gotten some of the details wrong here, but I'm sure someone will correct me.
Pcap uses different mechanisms to sniff on interfaces, but here's how the general case works:
A network card receives a packet (the driver is notified via an interrupt)
The kernel places that packet into appropriate listening queues: e.g.,
The TCP stack.
A bridge driver, if the interface is bridged.
The interface that PCAP uses (a raw socket connection).
Those buffers are flushed independently of each other:
As TCP streams are assembled and data delivered to processes.
As the bridge sends the packet to the appropriate connected interfaces.
As PCAP reads received packets.
I would guess that there is no hard way to guarantee that both programs receive both packets. That would require blocking on a buffer when it's full (and that could lead to starvation, deadlock, all kinds of problems). It may be possible with interconnects other than Ethernet, but the general philosophy there is best-effort.
Unless the system is under heavy-load however, I would say that the loss rates would be quite low and that most packets would be received by all. You can decrease the risk of loss by increasing the buffer sizes. A quick google search tuned this up, but I'm sure there's a million more ways to do it.
If you need hard guarantees, I think a more powerful model of the network is needed. I've heard great things about Netgraph for these kinds of tasks. You could also just install a physical box that inspects packets (the hardest guarantee you can get).

Is sending data via UDP sockets on the same machine reliable?

If i use UDP sockets for interprocess communication, can i expect that all send data is received by the other process in the same order?
I know this is not true for UDP in general.
No. I have been bitten by this before. You may wonder how it can possibly fail, but you'll run into issues of buffers of pending packets filling up, and consequently packets will be dropped. How the network subsystem drops packets is implementation-dependent and not specified anywhere.
In short, no. You shouldn't be making any assumptions about the order of data received on a UDP socket, even over localhost. It might work, it might not, and it's not guaranteed to.
No, there is no such guarantee, even with local sockets. If you want an IPC mechanism that guraantees in-order delivery you might look into using full-duplex pipes with popen(). This opens a pipe to the child process that either can read or write arbitrarily. It will guarantee in-order delivery and can be used with synchronous or asynchronous I/O (select() or poll()), depending on how you want to build the application.
On unix there are other options such as unix domain sockets or System V message queues (some of which may be faster) but reading/writing from a pipe is dead simple and works. As a bonus it's easy to test your server process because it is just reading and writing from Stdio.
On windows you could look into Named Pipes, which work somewhat differently from their unix namesake but are used for precisely this sort of interprocess communication.
Loopback UDP is incredibly unreliable on many platforms, you can easily see 50%+ data loss. Various excuses have been given to the effect that there are far better transport mechanisms to use.
There are many middleware stacks available these days to make IPC easier to use and cross platform. Have a look at something like ZeroMQ or 29 West's LBM which use the same API for intra-process, inter-process (IPC), and network communications.
The socket interface will probably not flow control the originator of the data, so you will probably see reliable transmission if you have higher level flow control but there is always the possibility that a memory crunch could still cause a dropped datagram.
Without flow control limiting kernel memory allocation for datagrams I imagine it will be just as unreliable as network UDP.

Resources