zero copy udp socket using sendfile instead of sendto - c

I'm working with udp sockets in real time environment. I am currently using standard socket functions sendto() which takes relatively a lot of time. I read that it is possible to use zero copy, that, if I understand well, avoids having extra time added by copying files from user/kernel environment. However, I see that sendfile() allows only to copy from one file descriptor to another. I can't see how I can use that to send UDP packets, which in my case is a buffer. So my question is:
is it even possible to use sendfile() to send UDP packets ?
If so,what is the correct way of doing this ?
Edit
I am working on a real time platform where I have several operations plus the sending over the socket. All of these should not take more than 1ms. I tried on three machines, the first has 4 cores at 3.4GHz, the other 8 cores at 2.3GHz and the last one 4 cores at 1.4GHz. On the first one it takes less than 1µs to send a 720 bytes packet. While on the two others it is between 6 and 9µs. I'm using a linux low latency kernel, and deactivated all CPU power management features, so all the CPUs are at max frequency.
I noticed that if the time taken by sendto() is larger than 6µs, the platform simply does not work. One other precision, I have several threads running in parallel. So maybe it is just the CPU processing other threads while the sendto() has not completed yet. I'm wondering it this is possible, to stop the sendto() while in process to do someting else.
This is why I was trying to find other solutions to do optimization somewhere else, and I thought that using sendfile() would avoid additional times to be saved.

I am not sure if sendfile works with UDP sockets, however, memfd_create creates a file descriptor from memory and theoretically could allow bypassing the copying from the user space to the kernel.
Still though, when sending the kernel has to copy the data into the kernel socket buffer first because it needs to prepend user data with UDP, IP and Ethernet headers, which cannot be done in-place. This copying cannot be avoided even when using sendfile.
To do real zero-copy networking you may like to have a look at PF_RING ZC (Zero Copy) drivers:
On-Demand Kernel Bypass with PF_RING Aware Drivers
PF_RING™ ZC comes with a new generation of PF_RING™ aware drivers that can be used both in kernel or bypass mode. Once installed, the drivers operate as standard Linux drivers where you can do normal networking (e.g. ping or SSH). When used from PF_RING™ they are quicker than vanilla drivers, as they interact directly with it. If you open a device using a PF_RING-aware driver in zero copy (e.g. pfcount -i zc:eth1) the device becomes unavailable to standard networking as it is accessed in zero-copy through kernel bypass, as happened with the predecessor DNA. Once the application accessing the device is closed, standard networking activities can take place again.

Related

Best Way To Receive/Process High Amounts Of Packets/Traffic Via AF_PACKET Socket + EPoll Questions

I've made a test C program that creates an AF_PACKET socket, creates x amount of threads via pthreads, and within each thread performs epoll on the socket's file descriptor. This program was made for Linux and I've compiled it using GCC on Ubuntu 18.04. I've submitted a GitHub Gist of the program here since it's 200+ lines of code. I am still fairly new to C and network programming. Therefore, I'm sure there are many improvements I can make to the code. I am open to suggestions!
I have two main questions:
Is there a better way to receive and process high amounts of packets/traffic in a user space program than the above? I've read using pthreads along with epoll would be the best option, but I've also looked into select and standard poll.
When the program above is executed without any debug output via fprintf(), each thread consumes 100% CPU on the epoll_wait() function within the while loop. Is this normal behavior or am I using epoll incorrectly? I've looked at some other examples and I use epoll the same way as the examples do. I've taken a look at the manual page for epoll and I believe I'm using it correctly in my case. I've also tried setting a timeout for the epoll_wait() function, but it was still consuming 100% CPU per thread (which I'd expect due to the while loop).
I plan to make a program that will redirect traffic after inspecting the traffic and I expect a lot of incoming packets which is why I wanted to see if there is a better way to receive and process high amounts of packets. I also understand I could just use standard SOCK_DGRAM or SOCK_STREAM sockets and bind them to an IP and port. However, I do want to process and inspect all incoming traffic to an interface and forward traffic if necessary (e.g. if the destination address matches a forwarding rule). I also wasn't sure if I should make multiple sockets in this case (perhaps a socket per thread). I did do this initially, but it resulted in unexpected behavior and it was only ever reading from one socket descriptor anyways. Perhaps I wasn't creating the new sockets properly.
Any help is highly appreciated and if you need any more information, please let me know.
Thank you for your time.

Thousands of IP Addresses/Interfaces vs. slow program performance

I have a CentOS 5.9 machine set up with 5000+ IP addresses (secondary) for eth2.
My program only uses 2 for 2 UDP sockets (1 RX, 1 TX).
When I run the application, the CPU usage is almost 100% all the time.
When I drop down the number of the IP addresses (10), everything go to the normal - hardly 1% CPU usage.
Program is basically a client - server application. It uses non blocking r/w and epoll_wait()
for event waiting.
Can someone please explain to me why so high CPU usage for binary that only use small portion
of configured addresses.
I don't think the question posted talks about number of sockets but rather number of addresses on the interface. Although it seems a little strange as to why your program goes too high in CPU with this number, but in general number of addresses will affect the performance of the IP stack to deal with incoming packets and outgoing packets. Like when you call a send, and your socket is not bound, kernel needs to determine an IP address to put in the packet based on the destination address, and if that takes time it will show up in your process context.
But these still does not explain much, I guess putting a gprof will be a good idea.
Handling thousands of sockets takes specialized software. Most network programmers naively use "select" and expect that to scale up to thousands of sockets well... which it definitely does not. A more event-driven model scales much better ... the event being a new socket or data on the socket, etc.
For Linux and Windows I use Libevent. It's a socket wrapper and not very hard to use and it scales nicely to ten-of-thousands of sockets.
http://libevent.org/
Look at the website here and you can see the logarithmic graph that shows tens of thousands of sockets performing as though they were 100. Of course, if the sockets are super busy, then you are right back to low-performance, but most sockets in the world are mostly quiet and this is where libevent shines. There are other libraries as well like ZeroMq (C# mono), libev, Boost.ASIO.
http://zeromq.org/
http://libev.schmorp.de/bench.html
http://www.boost.org/doc/libs/1_36_0/doc/html/boost_asio.html
Here is my working, super-simple sample. You'll need to add threading protections but with less than an hour's work, you could easily support a few thousand simultaneous connections.
http://pastebin.com/g02S2RTi

How do I block packets coming on port 23 on my computer?

I am using libpcap library. I have made one packet sniffer C program using pcap.h. Now I want to block packets coming on port 23 on my computer via eth0 device. I tried pcap_filter function but it is not useful for blocking.
Please explain to me how to code this functionality using c program.
Libpcap is just used for packet capturing, i.e. making packets available for use in other programs. It does not perform any network setup, like blocking, opening ports. In this sense pcap is a purely passive monitoring tool.
I am not sure what you want to do. As far as I see it, there are two possibilities:
You actually want to block the packets, so that your computer will not process them in any way. You should use a firewall for that and just block this port. Any decent firewall should be able to do that fairly easy. But you should be aware, that this also means no one will be able to ssh into your system. So if you do that on a remote system, you have effectively locked out yourself.
You still want other programs (sshd) to listen on port 23 but all this traffic is annoying you in your application. Libpcap has a filtering function for that, that is quite powerful. With this function you can pass small scripts to libpcap and have it only report packets that fit. Look up filtering in the pcap documentation for more information. This will however not "block the traffic" just stop pcap from capturing it.
Actually using pcap you are not able to build firewall. This is because packets seen inside your sniffer (built using pcap) are just copy of packets which (with or without sniffer) are consumed by network stack.
In other words: using filters in pcap will cause that you will not see copies of original packets (as far as I know pcap compiles filters and add those to kernel so that on kernel level copy will not be done); anyway original packet will go to network stack anyway.
The solution of your problem most probably could be done by netfilter. You can register in NF_IP_PRE_ROUTING hook and there decide to drop or allow given traffic.

Does recv remove packets from pcaps buffer?

Say there are two programs running on a computer (for the sake of simplification, the only user programs running on linux) one of which calls recv(), and one of which is using pcap to detect incoming packets. A packet arrives, and it is detected by both the program using pcap, and by the program using recv. But, is there any case (for instance recv() returning between calls to pcap_next()) in which one of these two will not get the packet?
I really don't understand how the buffering system works here, so the more detailed explanation the better - is there any conceivable case in which one of these programs would see a packet that the other does not? And if so, what is it and how can I prevent it?
AFAIK, there do exist cases where one would receive the data and the other wouldn't (both ways). It's possible that I've gotten some of the details wrong here, but I'm sure someone will correct me.
Pcap uses different mechanisms to sniff on interfaces, but here's how the general case works:
A network card receives a packet (the driver is notified via an interrupt)
The kernel places that packet into appropriate listening queues: e.g.,
The TCP stack.
A bridge driver, if the interface is bridged.
The interface that PCAP uses (a raw socket connection).
Those buffers are flushed independently of each other:
As TCP streams are assembled and data delivered to processes.
As the bridge sends the packet to the appropriate connected interfaces.
As PCAP reads received packets.
I would guess that there is no hard way to guarantee that both programs receive both packets. That would require blocking on a buffer when it's full (and that could lead to starvation, deadlock, all kinds of problems). It may be possible with interconnects other than Ethernet, but the general philosophy there is best-effort.
Unless the system is under heavy-load however, I would say that the loss rates would be quite low and that most packets would be received by all. You can decrease the risk of loss by increasing the buffer sizes. A quick google search tuned this up, but I'm sure there's a million more ways to do it.
If you need hard guarantees, I think a more powerful model of the network is needed. I've heard great things about Netgraph for these kinds of tasks. You could also just install a physical box that inspects packets (the hardest guarantee you can get).

Maximizing performance on udp

im working on a project with two clients ,one for sending, and the other one for receiving udp datagrams, between 2 machines wired directly to each other.
each datagram is 1024byte in size, and it is sent using winsock(blocking).
they are both running on a very fast machines(separate). with 16gb ram and 8 cpu's, with raid 0 drives.
im looking for tips to maximize my throughput , tips should be at winsock level, but if u have some other tips, it would be great also.
currently im getting 250-400mbit transfer speed. im looking for more.
thanks.
Since I don't know what else besides sending and receiving that your applications do it's difficult to know what else might be limiting it, but here's a few things to try. I'm assuming that you're using IPv4, and I'm not a Windows programmer.
Maximize the packet size that you are sending when you are using a reliable connection. For 100 mbs Ethernet the maximum packet is 1518, Ethernet uses 18 of that, IPv4 uses 20-64 (usually 20, thought), and UDP uses 8 bytes. That means that typically you should be able to send 1472 bytes of UDP payload per packet.
If you are using gigabit Ethernet equiptment that supports it your packet size increases to 9000 bytes (jumbo frames), so sending something closer to that size should speed things up.
If you are sending any acknowledgments from your listener to your sender then try to make sure that they are sent rarely and can acknowledge more than just one packet at a time. Try to keep the listener from having to say much, and try to keep the sender from having to wait on the listener for permission to keep sending.
On the computer that the sender application lives on consider setting up a static ARP entry for the computer that the receiver lives on. Without this every few seconds there may be a pause while a new ARP request is made to make sure that the ARP cache is up to date. Some ARP implementations may do this request well before the ARP entry expires, which would decrease the impact, but some do not.
Turn off as many users of the network as possible. If you are using an Ethernet switch then you should concentrate on the things that will introduce traffic to/from the computers/network devices on which your applications are running reside/use (this includes broadcast messages, like many ARP requests). If it's a hub then you may want to quiet down the entire network. Windows tends to send out a constant stream of junk to networks which in many cases isn't useful.
There may be limits set on how much of the network bandwidth that one application or user can have. Or there may be limits on how much network bandwidth the OS will let it self use. These can probably be changed in the registry if they exist.
It is not uncommon for network interface chips to not actually support the maximum bandwidth of the network all the time. There are chips which may miss packets because they are busy handling a previous packet as well as some which just can't send packets as close together as Ethernet specifications would allow. Additionally the rest of the system might not be able to keep up even if it is.
Some things to look at:
Connected UDP sockets (some info) shortcut several operations in the kernel, so are faster (see Stevens UnP book for details).
Socket send and receive buffers - play with SO_SNDBUF and SO_RCVBUF socket options to balance out spikes and packet drop
See if you can bump up link MTU and use jumbo frames.
use 1Gbps network and upgrade your network hardware...
Test the packet limit of your hardware with an already proven piece of code such as iperf:
http://www.noc.ucf.edu/Tools/Iperf/
I'm linking a Windows build, it might be a good idea to boot off a Linux LiveCD and try a Linux build for comparison of IP stacks.
More likely your NIC isn't performing well, try an Intel Gigabit Server Adapter:
http://www.intel.com/network/connectivity/products/server_adapters.htm
For TCP connections it has been shown that using multiple parallel connections will better utilize the data connection. I'm not sure if that applies to UDP, but it might help with some of the latency issues of packet processing.
So you might want to try multiple threads of blocking calls.
As well as Nikolai's suggestion of send and recv buffers, if you can, switch to overlapped I/O and have many recvs pending, this also helps to minimise the number of datagrams that are dropped by the stack due to lack of buffer space.
If you're looking for reliable data transfer, consider UDT.

Resources