If I was to receive UDP packets on Linux (and I didn't mind changing some of the source code) what would be the fastest way for my application to read the packets?
Would I want to modify the network stack so that once a UDP packet is received it is written to shared memory and have the application access that memory?
Would there be any way for the stack to notify the application to react, rather than have the application continuously poll the shared memory?
Any advice/further resources are welcome- I have only seen:
http://www.kegel.com/c10k.html
If latency is a problem and the default UDP network stack does not perform as you wish, then try to use different existing (installable) network stacks.
Example, try UDP Lite, compare to the standard UDP stack, this particular stack does not perform any checksum on the UDP datagram, thus reducing latencies at the cost of providing corrupted datagram to the application layer.
Side note: you do not need to have a "polling" mechanism. Read the manual of select (and it's possible derivative like pselect or ppoll), with such API, the kernel will "wake up" your application as soon as it has something to read or write in the pipeline.
Related
Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?
There are many techniques for networking with kernel bypass.
First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.
Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.
Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.
Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.
The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.
InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).
Cant find myself the answer for such a question:
Is there any benefit/boost to sockets in general at multi-core machine. I mean is there maybe some kind of sharing access to packets queue incoming to kernel from the ethernet-card driver or smth.
I understand that when it comes up to API call there can be multiple threads working with one socket instance, but it is up to programmer to synchronize and play correctly with calls to read/write/close/select etc. So at that level i see benefit only in working with dispatched packets and post processing etc... Or there is no speed boost until the packet copied during system call and transferred to user space?
The benefit of multi-core depends on how concurrency can be achieved in your algorithm. Take ethernet receiving as example, there are 4 tasks involved in it.
On receiving packet, the NIC hardware trigger an interrupt and CPU handle interrupt in interrupt context.
The network stack handle RX packet in software irq mechanism. The software irq requests can be run concurrently on multiple CPU. In network stack RX function, it pass network buffer to socket and wakes up user thread pendning at the socket.
The user thread waked up and continue user application code to receive or processes received network packets.
1) can only run at one CPU, 2) can run in multiple CPU and 3) can also runs on multiple CPU for multi-process or multi-thread applications.
Following from #Greg Inozemtsev's comment. NICs with multiple receive queues can filter incoming traffic into different queues. Each queue can generate a unique interrupt to the CPU to signal incoming packets. Each queue can be assigned a different interrupt that can be dispatched to a unique CPU core.
Linux supports various techniques in the Kernel such as:
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Accelerated Receive Flow Steering
XPS: Transmit Packet Steering
Lets say all the packets are destined to your machine IP on port 80 and you have used socket() and listen() to create a socket on port 80. Traffic is coming into your NIC from a variety of source IPs so its being hashed into multiple receive queues (meaning the hardware interrupts are being spread across multiple CPU cores thanks to RSS). You can use the native kernel socket option PACKET_FANOUT to spread the load across multiple worker threads within your application.
If you look to 3rd party libraries NetMap, DPDK and VPP, just as examples, these can all be used to scale up even further by take you down to zero copy RX/TX with the caveat you need to write some of the network protocol code your self depending on which 3rd party library you use.
There are many many things to consider here, far too much to cover in one SO question. In answer to your original question; yes. Although I have tried to provide a bit of extra information.
For reading on RSS/RPS/RFS etc:
https://blog.cloudflare.com/how-to-receive-a-million-packets/
For further reading related to native PACKET_FANOUT and PACKET_MMAP:
http://kukuruku.co/hub/nix/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-libraries
http://yusufonlinux.blogspot.co.uk/2010/11/data-link-access-and-zero-copy.html?m=1
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
Further reading related to NetMap as an example of 3rd party library
https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/
Further reading on NUMA and affinity:
https://null.53bits.co.uk/index.php?page=numa-and-queue-affinity
https://blog.cloudflare.com/how-to-achieve-low-latency/
I currently have a client listening for packets in its own thread. I was told to try to implement an ISR so that the packet received from the recv() call can be handled immediately, instead of waiting for that thread to get scheduled.
EDIT: this is in windows now, but it will ported to a DSP later.
ISRs by definition run in kernel space. Unless you are in an embedded system without memory protection, you will need to add kernel code to your project. Furthermore, to reimplement recv, it will need to handle IP and TCP or UDP as necessary to extract the data from the ethernet packets.
The overhead of rescheduling and switching to a thread is minimal, and needs to happen anyway unless the packet is handled entirely in the kernel. Most operating systems have a highest-priority thread setting, sometimes called "real-time," which causes user space code to run with minimal delay after the driver receives data. This is often used for audio/video I/O as well as networking.
Say there are two programs running on a computer (for the sake of simplification, the only user programs running on linux) one of which calls recv(), and one of which is using pcap to detect incoming packets. A packet arrives, and it is detected by both the program using pcap, and by the program using recv. But, is there any case (for instance recv() returning between calls to pcap_next()) in which one of these two will not get the packet?
I really don't understand how the buffering system works here, so the more detailed explanation the better - is there any conceivable case in which one of these programs would see a packet that the other does not? And if so, what is it and how can I prevent it?
AFAIK, there do exist cases where one would receive the data and the other wouldn't (both ways). It's possible that I've gotten some of the details wrong here, but I'm sure someone will correct me.
Pcap uses different mechanisms to sniff on interfaces, but here's how the general case works:
A network card receives a packet (the driver is notified via an interrupt)
The kernel places that packet into appropriate listening queues: e.g.,
The TCP stack.
A bridge driver, if the interface is bridged.
The interface that PCAP uses (a raw socket connection).
Those buffers are flushed independently of each other:
As TCP streams are assembled and data delivered to processes.
As the bridge sends the packet to the appropriate connected interfaces.
As PCAP reads received packets.
I would guess that there is no hard way to guarantee that both programs receive both packets. That would require blocking on a buffer when it's full (and that could lead to starvation, deadlock, all kinds of problems). It may be possible with interconnects other than Ethernet, but the general philosophy there is best-effort.
Unless the system is under heavy-load however, I would say that the loss rates would be quite low and that most packets would be received by all. You can decrease the risk of loss by increasing the buffer sizes. A quick google search tuned this up, but I'm sure there's a million more ways to do it.
If you need hard guarantees, I think a more powerful model of the network is needed. I've heard great things about Netgraph for these kinds of tasks. You could also just install a physical box that inspects packets (the hardest guarantee you can get).
If i use UDP sockets for interprocess communication, can i expect that all send data is received by the other process in the same order?
I know this is not true for UDP in general.
No. I have been bitten by this before. You may wonder how it can possibly fail, but you'll run into issues of buffers of pending packets filling up, and consequently packets will be dropped. How the network subsystem drops packets is implementation-dependent and not specified anywhere.
In short, no. You shouldn't be making any assumptions about the order of data received on a UDP socket, even over localhost. It might work, it might not, and it's not guaranteed to.
No, there is no such guarantee, even with local sockets. If you want an IPC mechanism that guraantees in-order delivery you might look into using full-duplex pipes with popen(). This opens a pipe to the child process that either can read or write arbitrarily. It will guarantee in-order delivery and can be used with synchronous or asynchronous I/O (select() or poll()), depending on how you want to build the application.
On unix there are other options such as unix domain sockets or System V message queues (some of which may be faster) but reading/writing from a pipe is dead simple and works. As a bonus it's easy to test your server process because it is just reading and writing from Stdio.
On windows you could look into Named Pipes, which work somewhat differently from their unix namesake but are used for precisely this sort of interprocess communication.
Loopback UDP is incredibly unreliable on many platforms, you can easily see 50%+ data loss. Various excuses have been given to the effect that there are far better transport mechanisms to use.
There are many middleware stacks available these days to make IPC easier to use and cross platform. Have a look at something like ZeroMQ or 29 West's LBM which use the same API for intra-process, inter-process (IPC), and network communications.
The socket interface will probably not flow control the originator of the data, so you will probably see reliable transmission if you have higher level flow control but there is always the possibility that a memory crunch could still cause a dropped datagram.
Without flow control limiting kernel memory allocation for datagrams I imagine it will be just as unreliable as network UDP.