Packet loss caused by OpenSSL? Weird CPU usage

Packet loss caused by OpenSSL? Weird CPU usage - c

I'm writing network application reading packets from UDP sockets and then decrypting them using OpenSSL.
The main function looks like this:
receive(){
while(1){
read(udp_sock);
decrypt_packet();
}
}
Program used to work fine until I added encryption. Now there's a lot of packets lost between kernel buffer and my application (netstat -su - RcvbufErrors: 77123 and growing ;). Packets are rather big (60K) and I try to use it on 1Gbps ethernet (thus problem begins after exceeding 100Mbps)
Sounds normal - decryption taking too much time and packets are coming too fast. The problem is - CPU usage never exceeds 30% both on the sender and the receiver.
Problem disappears after commenting out this statement in decrypt_packet():
AES_ctr128_encrypt();
My question is - is it possible, that OpenSSL is using some set of instruction which are not count in to CPU usage (I use htop and Gnome system monitor)? If not what else can cause such packet loss is CPU power is still available for processing?

How many CPU cores does your system have? Is your code single threaded? It could be maxing out a single core and thus using only 25% of the available CPU.

Using profiler I was able to solve the problem. OpenSSL is using special set of instructions, which are executed in special part of CPU. Shown CPU usage was low, but in fact it was occupied doing encryption, so my application couldn't read system buffer fast enough.
I moved decryption to other thread which solved the problem. And now the thread handling all encryption is shown as using 0% CPU all the time.

Related

lwip to send data bigger than 64kb

i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco

Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.

Raw Sockets Vs Libpcap in sending performance

I'm currently attempting to get the best sending performance for an 802.11 frame, I am using libpcap but I wondered if I could speed it up using raw sockets (or any other possible method).
Consider this simple example code for libpcap with a device handle already created previously:
char ourPacket[60][50] = { {0x01, 0x02, ... , 0x50}, ... , {0x01, 0x02, ... , 0x50} };
for( ; ; )
{
for(int i; i = 0; i < 60; ++i)
{
pcap_sendpacket(deviceHandle, ourPacket[i], 50);
}
}
This code segment is done on a thread for each separate CPU core. Is there any faster way to do this for raw 802.11 frame/packets containing Radiotap headers that are stored in an array?
Looking at pcap's source code for pcap_inject (the same function but different return value), it doesn't seem to be using raw sockets to send packets? No clue.
I don't care about capturing performance, as a lot of the other questions have answered that. Is raw sockets even for sending layer 2 packets/frames?

As Gill Hamilton mentioned, the answer will depend on a lot of things. If you see super gains on one system, you may not see them on another, even if both are "running Linux". You're better off testing the code yourself. That being said, here is some info from what my team has found:
Note 1: all the gains were for code that did not just write frames/packets to sockets, but analyzed them and processed them, so it is likely that much or most of our gains were there rather than the read/write.
We are writing a direct raw socket implementation to send/receive Ethernet frames and IP packets. We're seeing about a 250%-450% performance gain on our measliest R&D system which is a MIPS 24K 5V system on a chip with a MT7530 integrated Ethernet NIC/Switch which can barely handle sustained 80 Mbit. On a very modest but much beefier test system with an Intel Celeron J1900 and I211 Gigabit controllers it drops to about 50%-100% vs c libpcap. In fact, we only saw about 80%-150% vs. Python dpkt/scapy implementation. We only saw maybe about a 20% gain on a generic i5 Linux dual-gigabit system vs. a c libpcap implementation. So based on our non-rigorous testing, we saw up to a 20x difference in performance gains of the code depending on the system.
Note 2: All of these gains were when using maximum optimizations and strictest compile parameters during compiling of the custom c code, but not necessarily for the c libpcap code (making all warnings errors on some of the above systems make the libpcap code not compile, and who wants to debug that?), so the differences may be less dramatic. We need to squeeze out every last ounce of performance to enable some sophisticated packet processing using no more than 5.0V and 1.5A, so we'll ultimately be going with a custom ASIC which may be FPGA. That being said, it's A LOT of work to get it working without bugs and we're likely going to be implementing significant portions of the Ethernet/IP/TCP/UPD stack, so I don't recommend it.
Last note: The CPU usage on the MIPS 24K system was about 1/10 for the custom code, but again, I would say that that vast majority of that gain was from the processing.

Receiving RAW socket packets with microseconds level accuracy

I am writing a code, which receives raw ethernet packets (no TCP/UDP) every 1ms from the server. For every packet received, my application has to reply with 14 raw packets. If the server doesn't receive the 14 packets before it sends it's packet scheduled for every 1ms, then the server raises an alarm and the application has to break out. The server-client communication is a one to one link.
The server is a hardware (FPGA) which generates packets at precise 1ms interval. The client application runs on a Linux (RHEL/Centos 7) machine with 10G SolarFlare NIC.
My first version of code is like this
while(1)
{
while(1)
{
numbytes = recvfrom(sockfd, buf, sizeof(buf), 0, NULL, NULL);
if(numbytes > 0)
{
//Some more lines here, to read packet number
break;
}
}
for (i=0;i<14;i++)
{
if (sendto(sockfd,(void *)(sym) , sizeof(sym), 0, NULL, NULL) < 0)
perror("Send failed\n");
}
}
I measure the receive time by taking timestamps (using clock_gettime) before the recvfrom call and one after it, I print the time differences of these timestamps and print them whenever the time difference exceeds allowable range of 900-1100 us.
The problem I am facing is that the packet receive time is fluctuating.Something like this (the prints are in microseconds)
Decode Time : 1234
Decode Time : 762
Decode Time : 1593
Decode Time : 406
Decode Time : 1703
Decode Time : 257
Decode Time : 1493
Decode Time : 514
and so on..
And sometimes the decode times exceed 2000us and application would break.
In this situation, application would break anywhere between 2 seconds to a few minutes.
Options tried by me till now.
Setting affinity to a particular isolated core.
Setting scheduling priorities to maximum with SCHED_FIFO
Increase socket buffer sizes
Setting network interface interrupt affinity to the same core which processes application
Spinning over recvfrom using poll(),select() calls.
All these options give a significant improvement over initial version of code. Now the application would run for ~1-2 hours. But this is still not enough.
A few observations:
I get a a huge dump of these decode time prints, whenever I take ssh sessions to Linux machine while the application is running (which makes me think network communication over other 1G Ethernet interface is creating interference with the 10G Ethernet interface).
The application performs better in RHEL (run times of about 2-3 hours) than Centos (run times of about 30 mins - 1.5 hours)
The run times is also varying with Linux machines with different hardware configurations with same OS.
Please suggest if there are any other methods to improve the run-time of the application.
Thanks in advance.

First, you need to verify the accuracy of the timestamping method; clock_gettime. The resolution is nanoseconds, but the accuracy and precision is in question. That is not the answer to your problem, but informs on how reliable the timestamping is before proceeding. See Difference between CLOCK_REALTIME and CLOCK_MONOTONIC? for why CLOCK_MONOTONIC should be used for your application.
I suspect the majority of the decode time fluctuation is either due to a variable number of operations per decode, context switching of the operating system, or IRQs.
Operations per decode I cannot comment on since the code has been simplified in your post. This issue can also be profiled and inspected.
Context switching per process can be easily inspected and monitored https://unix.stackexchange.com/a/84345
As Ron stated, these are very strict timing requirements for a network. It must be an isolated network, and single purpose. Your observation regarding decode over-time when ssh'ing indicates all other traffic must be prevented. This is disturbing, given separate NICs. Thus I suspect IRQs are the issue. See /proc/interrupts.
To achieve consistent decode times over long intervals (hours->days) will require drastically simplifying the OS. Removing unnecessary processes and services, hardware, and perhaps building your own kernel. All for the goal of reducing context switching and interrupts. At which point a real-time OS should be considered. This will only improve the probability of consistent decode time, not guarantee.
My work is developing a data acquisition system that is a combination of FPGA ADC, PC, and ethernet. Inevitably, the inconsistency of a multi-purpose PC means certain features must be moved to dedicated hardware. Consider the Pros/Cons of developing your application for PC versus moving it to hardware.

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

UDP sendto performance over loopback

Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229

Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.

i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight