Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229
Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.
i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.
Related
i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco
Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.
I am using a python tool to generate packets in one VM and I capture them in my program running as a linux process in another VM. Both the VMs are ubuntu and they are running on the same subnet. I notice that some of the packets get dropped in my program. What is the best tool to know where the packets are dropped?
I see that RcvbufErrors in netstat output gets increased as I send new packets.
# netstat -us
IcmpMsg:
InType0: 14
InType3: 1493
InType5: 204
InType8: 54
InType13: 5
InType17: 5
OutType0: 54
OutType3: 645946
OutType8: 584
OutType14: 5
Udp:
7686124 packets received
646545 packets to unknown port received.
33928069 packet receive errors
7157259 packets sent
RcvbufErrors: 33928069
IgnoredMulti: 345772
UdpLite:
IpExt:
InMcastPkts: 4
InBcastPkts: 363522
InOctets: 13243409806
OutOctets: 8445992434
InMcastOctets: 144
InBcastOctets: 114457552
InNoECTPkts: 100191590
InECT0Pkts: 143
Well, if you think that you need to find out where some packets get dropped, this might not be useful since you already know that at least RcvbufErrors figure is growing. This means that the NIC is able to deliver the packets to the kernel but the latter is unable to deliver the packets to the application, obviously, because a fixed-size receive buffer gets full faster than the application could read the data out (or just flush it).
So, I'd say such a result just gives an impression of poor application design. Perhaps, you should consider some good technique to enhance packet capture in your application so that more packets are captured/examined/dropped per second. A good example is PACKET_MMAP Rx ring which has an exhaustive description and is widely used in libpcap-based applications. The latter aspect makes it indeed reasonable to use wireshark or just tcpdump instead of your hand-made app to inspect the real capture rate, etc.
I am writing a code, which receives raw ethernet packets (no TCP/UDP) every 1ms from the server. For every packet received, my application has to reply with 14 raw packets. If the server doesn't receive the 14 packets before it sends it's packet scheduled for every 1ms, then the server raises an alarm and the application has to break out. The server-client communication is a one to one link.
The server is a hardware (FPGA) which generates packets at precise 1ms interval. The client application runs on a Linux (RHEL/Centos 7) machine with 10G SolarFlare NIC.
My first version of code is like this
while(1)
{
while(1)
{
numbytes = recvfrom(sockfd, buf, sizeof(buf), 0, NULL, NULL);
if(numbytes > 0)
{
//Some more lines here, to read packet number
break;
}
}
for (i=0;i<14;i++)
{
if (sendto(sockfd,(void *)(sym) , sizeof(sym), 0, NULL, NULL) < 0)
perror("Send failed\n");
}
}
I measure the receive time by taking timestamps (using clock_gettime) before the recvfrom call and one after it, I print the time differences of these timestamps and print them whenever the time difference exceeds allowable range of 900-1100 us.
The problem I am facing is that the packet receive time is fluctuating.Something like this (the prints are in microseconds)
Decode Time : 1234
Decode Time : 762
Decode Time : 1593
Decode Time : 406
Decode Time : 1703
Decode Time : 257
Decode Time : 1493
Decode Time : 514
and so on..
And sometimes the decode times exceed 2000us and application would break.
In this situation, application would break anywhere between 2 seconds to a few minutes.
Options tried by me till now.
Setting affinity to a particular isolated core.
Setting scheduling priorities to maximum with SCHED_FIFO
Increase socket buffer sizes
Setting network interface interrupt affinity to the same core which processes application
Spinning over recvfrom using poll(),select() calls.
All these options give a significant improvement over initial version of code. Now the application would run for ~1-2 hours. But this is still not enough.
A few observations:
I get a a huge dump of these decode time prints, whenever I take ssh sessions to Linux machine while the application is running (which makes me think network communication over other 1G Ethernet interface is creating interference with the 10G Ethernet interface).
The application performs better in RHEL (run times of about 2-3 hours) than Centos (run times of about 30 mins - 1.5 hours)
The run times is also varying with Linux machines with different hardware configurations with same OS.
Please suggest if there are any other methods to improve the run-time of the application.
Thanks in advance.
First, you need to verify the accuracy of the timestamping method; clock_gettime. The resolution is nanoseconds, but the accuracy and precision is in question. That is not the answer to your problem, but informs on how reliable the timestamping is before proceeding. See Difference between CLOCK_REALTIME and CLOCK_MONOTONIC? for why CLOCK_MONOTONIC should be used for your application.
I suspect the majority of the decode time fluctuation is either due to a variable number of operations per decode, context switching of the operating system, or IRQs.
Operations per decode I cannot comment on since the code has been simplified in your post. This issue can also be profiled and inspected.
Context switching per process can be easily inspected and monitored https://unix.stackexchange.com/a/84345
As Ron stated, these are very strict timing requirements for a network. It must be an isolated network, and single purpose. Your observation regarding decode over-time when ssh'ing indicates all other traffic must be prevented. This is disturbing, given separate NICs. Thus I suspect IRQs are the issue. See /proc/interrupts.
To achieve consistent decode times over long intervals (hours->days) will require drastically simplifying the OS. Removing unnecessary processes and services, hardware, and perhaps building your own kernel. All for the goal of reducing context switching and interrupts. At which point a real-time OS should be considered. This will only improve the probability of consistent decode time, not guarantee.
My work is developing a data acquisition system that is a combination of FPGA ADC, PC, and ethernet. Inevitably, the inconsistency of a multi-purpose PC means certain features must be moved to dedicated hardware. Consider the Pros/Cons of developing your application for PC versus moving it to hardware.
I have noticed that there is a difference of performance between sendto and recvfrom (UDP). I send about 100Kbytes from a server to a client, using WiFi (the estimated bandwidth is about 30Mb/s in both directions), and the sending time is about 4-5 ms (it depends, but this value is comparable to the ideal one, 3ms). On the client, the receiving time is ten-fifteen times higher, like 50-60ms. I'd like to have the two elapsed times quite similar. Any idea?
and the sending time is about 4-5 ms (it depends, but this value is comparable to the ideal one, 3ms)
30Mb/s (where the b means bits) is approximately (give or take to account for headers etc) 3 MB/s (where the B means bytes). It should take roughly 30 milliseconds to transmit 100kBytes.
The sendto is returning as soon as it has written all the data to the local buffer of the network stack of the sending machine. The recv obviously has to wait for the data to be transmitted, including latency and stuff needed for all the layers of protocols.
Issue summary: AF_UNIX stable sending, bursty receiving.
I have an application B that receives data over unix domain datagram socket. There is peer application A that sends data to it. Both A and B are running continuously (and are SCHED_FIFO). My application A also prints the time of reception.
The peer application B can send data at varying timings (varying in terms of milliseconds only). Ideally (what I expect) the packet send delay should exactly match with reception delay. For example:
A sends in time : 5ms 10ms 15ms 21ms 30ms 36ms
B should receive in time : 5+x ms 10+x ms 15+x ms 21+x ms ...
Where x is a constant delay.
But when I experimented what I observe in B is :
A sends in time : 5ms 10ms 15ms 21ms 30ms 36ms
B received in time : 5+w ms 10+x ms 15+y ms 21+z ms ...
(w,x,y,z are different constant delays). So I cannot predict reception time when sending time is given).
Is it because some buffering is involved in unix domain socket ? Please suggest some workaround for the issue so that the reception time is predicable from send time. I need 1 millisecond accuracy.
(I am using vanilla Linux 3.0 kernel)
As you are using blocking recv(), when no datagram is available your program will be unscheduled. This is bad for your use case--you want your program to stay hot. So make your recv() non-blocking, and handle EAGAIN by simply busy waiting. This will consume 100% of one core, but I think you'll find it helps you achieve your goal.