Any benefit in sockets from multiple cores? (Linux) - c

Cant find myself the answer for such a question:
Is there any benefit/boost to sockets in general at multi-core machine. I mean is there maybe some kind of sharing access to packets queue incoming to kernel from the ethernet-card driver or smth.
I understand that when it comes up to API call there can be multiple threads working with one socket instance, but it is up to programmer to synchronize and play correctly with calls to read/write/close/select etc. So at that level i see benefit only in working with dispatched packets and post processing etc... Or there is no speed boost until the packet copied during system call and transferred to user space?

The benefit of multi-core depends on how concurrency can be achieved in your algorithm. Take ethernet receiving as example, there are 4 tasks involved in it.
On receiving packet, the NIC hardware trigger an interrupt and CPU handle interrupt in interrupt context.
The network stack handle RX packet in software irq mechanism. The software irq requests can be run concurrently on multiple CPU. In network stack RX function, it pass network buffer to socket and wakes up user thread pendning at the socket.
The user thread waked up and continue user application code to receive or processes received network packets.
1) can only run at one CPU, 2) can run in multiple CPU and 3) can also runs on multiple CPU for multi-process or multi-thread applications.

Following from #Greg Inozemtsev's comment. NICs with multiple receive queues can filter incoming traffic into different queues. Each queue can generate a unique interrupt to the CPU to signal incoming packets. Each queue can be assigned a different interrupt that can be dispatched to a unique CPU core.
Linux supports various techniques in the Kernel such as:
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Accelerated Receive Flow Steering
XPS: Transmit Packet Steering
Lets say all the packets are destined to your machine IP on port 80 and you have used socket() and listen() to create a socket on port 80. Traffic is coming into your NIC from a variety of source IPs so its being hashed into multiple receive queues (meaning the hardware interrupts are being spread across multiple CPU cores thanks to RSS). You can use the native kernel socket option PACKET_FANOUT to spread the load across multiple worker threads within your application.
If you look to 3rd party libraries NetMap, DPDK and VPP, just as examples, these can all be used to scale up even further by take you down to zero copy RX/TX with the caveat you need to write some of the network protocol code your self depending on which 3rd party library you use.
There are many many things to consider here, far too much to cover in one SO question. In answer to your original question; yes. Although I have tried to provide a bit of extra information.
For reading on RSS/RPS/RFS etc:
https://blog.cloudflare.com/how-to-receive-a-million-packets/
For further reading related to native PACKET_FANOUT and PACKET_MMAP:
http://kukuruku.co/hub/nix/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-libraries
http://yusufonlinux.blogspot.co.uk/2010/11/data-link-access-and-zero-copy.html?m=1
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
Further reading related to NetMap as an example of 3rd party library
https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/
Further reading on NUMA and affinity:
https://null.53bits.co.uk/index.php?page=numa-and-queue-affinity
https://blog.cloudflare.com/how-to-achieve-low-latency/

Related

Low latency packet processing with shared memory on Linux?

If I was to receive UDP packets on Linux (and I didn't mind changing some of the source code) what would be the fastest way for my application to read the packets?
Would I want to modify the network stack so that once a UDP packet is received it is written to shared memory and have the application access that memory?
Would there be any way for the stack to notify the application to react, rather than have the application continuously poll the shared memory?
Any advice/further resources are welcome- I have only seen:
http://www.kegel.com/c10k.html
If latency is a problem and the default UDP network stack does not perform as you wish, then try to use different existing (installable) network stacks.
Example, try UDP Lite, compare to the standard UDP stack, this particular stack does not perform any checksum on the UDP datagram, thus reducing latencies at the cost of providing corrupted datagram to the application layer.
Side note: you do not need to have a "polling" mechanism. Read the manual of select (and it's possible derivative like pselect or ppoll), with such API, the kernel will "wake up" your application as soon as it has something to read or write in the pipeline.

How to change the recv() portion of a client to an Interrupt Service Routine?

I currently have a client listening for packets in its own thread. I was told to try to implement an ISR so that the packet received from the recv() call can be handled immediately, instead of waiting for that thread to get scheduled.
EDIT: this is in windows now, but it will ported to a DSP later.
ISRs by definition run in kernel space. Unless you are in an embedded system without memory protection, you will need to add kernel code to your project. Furthermore, to reimplement recv, it will need to handle IP and TCP or UDP as necessary to extract the data from the ethernet packets.
The overhead of rescheduling and switching to a thread is minimal, and needs to happen anyway unless the packet is handled entirely in the kernel. Most operating systems have a highest-priority thread setting, sometimes called "real-time," which causes user space code to run with minimal delay after the driver receives data. This is often used for audio/video I/O as well as networking.

Does recv remove packets from pcaps buffer?

Say there are two programs running on a computer (for the sake of simplification, the only user programs running on linux) one of which calls recv(), and one of which is using pcap to detect incoming packets. A packet arrives, and it is detected by both the program using pcap, and by the program using recv. But, is there any case (for instance recv() returning between calls to pcap_next()) in which one of these two will not get the packet?
I really don't understand how the buffering system works here, so the more detailed explanation the better - is there any conceivable case in which one of these programs would see a packet that the other does not? And if so, what is it and how can I prevent it?
AFAIK, there do exist cases where one would receive the data and the other wouldn't (both ways). It's possible that I've gotten some of the details wrong here, but I'm sure someone will correct me.
Pcap uses different mechanisms to sniff on interfaces, but here's how the general case works:
A network card receives a packet (the driver is notified via an interrupt)
The kernel places that packet into appropriate listening queues: e.g.,
The TCP stack.
A bridge driver, if the interface is bridged.
The interface that PCAP uses (a raw socket connection).
Those buffers are flushed independently of each other:
As TCP streams are assembled and data delivered to processes.
As the bridge sends the packet to the appropriate connected interfaces.
As PCAP reads received packets.
I would guess that there is no hard way to guarantee that both programs receive both packets. That would require blocking on a buffer when it's full (and that could lead to starvation, deadlock, all kinds of problems). It may be possible with interconnects other than Ethernet, but the general philosophy there is best-effort.
Unless the system is under heavy-load however, I would say that the loss rates would be quite low and that most packets would be received by all. You can decrease the risk of loss by increasing the buffer sizes. A quick google search tuned this up, but I'm sure there's a million more ways to do it.
If you need hard guarantees, I think a more powerful model of the network is needed. I've heard great things about Netgraph for these kinds of tasks. You could also just install a physical box that inspects packets (the hardest guarantee you can get).

Custom RS485 Protocols

I am writing a simple multi-drop RS485 protocol for serial communications within a distributed system. I am using an addressable model where slave devices are given a window of 20ms to respond. The master uC polls the connected devices for updates and they respond accordingly. I've employed checksums and take the necessary overrun precautions to ensure that connected devices will not respond to malformed messages. This method has proved effective in approximately 99% of situations, but I lose the packet if a new device is introduced during a communication session. Plugging in a new device "hot" will have negative effects on the signal being monitored by the slave devices, if only for an extremely short time. I'm on the software side of engineering, but how I can mitigate this situation without trying to recreate TCP? We use a polling model because it is fast and does the job well for our application, no need for RTOS functionality. I have an abundance of cycles on each cpu, think in basic terms.
Sending packets over the RS485 is not a reliable communication. You will have to handle the lost of packets anyway. Of course, you won't have to reinvent TCP. But you will have to detect lost packets by means of timeout monitoring and sequence numbers. In simple applications this can be done at application level, what keeps you far off from the complexity of TCP. When your polling model discards all packets with invalid checksum this might be integrated with less effort.
If you want to check for collisions, that can be caused by hot plugs or misbehaving devices there are probably some improvements. Some hardware allows to read back the own transmissing. If you find a difference between sent data and receive data, you can assume a collision and repeat the packet. This will also require a kind of sequence numbering.
Perhaps I've missed something in your question, but can't you just write the master so that if a response isn't seen from a device within the allowed time, it re-polls that device?

Maximizing performance on udp

im working on a project with two clients ,one for sending, and the other one for receiving udp datagrams, between 2 machines wired directly to each other.
each datagram is 1024byte in size, and it is sent using winsock(blocking).
they are both running on a very fast machines(separate). with 16gb ram and 8 cpu's, with raid 0 drives.
im looking for tips to maximize my throughput , tips should be at winsock level, but if u have some other tips, it would be great also.
currently im getting 250-400mbit transfer speed. im looking for more.
thanks.
Since I don't know what else besides sending and receiving that your applications do it's difficult to know what else might be limiting it, but here's a few things to try. I'm assuming that you're using IPv4, and I'm not a Windows programmer.
Maximize the packet size that you are sending when you are using a reliable connection. For 100 mbs Ethernet the maximum packet is 1518, Ethernet uses 18 of that, IPv4 uses 20-64 (usually 20, thought), and UDP uses 8 bytes. That means that typically you should be able to send 1472 bytes of UDP payload per packet.
If you are using gigabit Ethernet equiptment that supports it your packet size increases to 9000 bytes (jumbo frames), so sending something closer to that size should speed things up.
If you are sending any acknowledgments from your listener to your sender then try to make sure that they are sent rarely and can acknowledge more than just one packet at a time. Try to keep the listener from having to say much, and try to keep the sender from having to wait on the listener for permission to keep sending.
On the computer that the sender application lives on consider setting up a static ARP entry for the computer that the receiver lives on. Without this every few seconds there may be a pause while a new ARP request is made to make sure that the ARP cache is up to date. Some ARP implementations may do this request well before the ARP entry expires, which would decrease the impact, but some do not.
Turn off as many users of the network as possible. If you are using an Ethernet switch then you should concentrate on the things that will introduce traffic to/from the computers/network devices on which your applications are running reside/use (this includes broadcast messages, like many ARP requests). If it's a hub then you may want to quiet down the entire network. Windows tends to send out a constant stream of junk to networks which in many cases isn't useful.
There may be limits set on how much of the network bandwidth that one application or user can have. Or there may be limits on how much network bandwidth the OS will let it self use. These can probably be changed in the registry if they exist.
It is not uncommon for network interface chips to not actually support the maximum bandwidth of the network all the time. There are chips which may miss packets because they are busy handling a previous packet as well as some which just can't send packets as close together as Ethernet specifications would allow. Additionally the rest of the system might not be able to keep up even if it is.
Some things to look at:
Connected UDP sockets (some info) shortcut several operations in the kernel, so are faster (see Stevens UnP book for details).
Socket send and receive buffers - play with SO_SNDBUF and SO_RCVBUF socket options to balance out spikes and packet drop
See if you can bump up link MTU and use jumbo frames.
use 1Gbps network and upgrade your network hardware...
Test the packet limit of your hardware with an already proven piece of code such as iperf:
http://www.noc.ucf.edu/Tools/Iperf/
I'm linking a Windows build, it might be a good idea to boot off a Linux LiveCD and try a Linux build for comparison of IP stacks.
More likely your NIC isn't performing well, try an Intel Gigabit Server Adapter:
http://www.intel.com/network/connectivity/products/server_adapters.htm
For TCP connections it has been shown that using multiple parallel connections will better utilize the data connection. I'm not sure if that applies to UDP, but it might help with some of the latency issues of packet processing.
So you might want to try multiple threads of blocking calls.
As well as Nikolai's suggestion of send and recv buffers, if you can, switch to overlapped I/O and have many recvs pending, this also helps to minimise the number of datagrams that are dropped by the stack due to lack of buffer space.
If you're looking for reliable data transfer, consider UDT.

Resources