Receiving RAW socket packets with microseconds level accuracy

Receiving RAW socket packets with microseconds level accuracy - c

I am writing a code, which receives raw ethernet packets (no TCP/UDP) every 1ms from the server. For every packet received, my application has to reply with 14 raw packets. If the server doesn't receive the 14 packets before it sends it's packet scheduled for every 1ms, then the server raises an alarm and the application has to break out. The server-client communication is a one to one link.
The server is a hardware (FPGA) which generates packets at precise 1ms interval. The client application runs on a Linux (RHEL/Centos 7) machine with 10G SolarFlare NIC.
My first version of code is like this
while(1)
{
while(1)
{
numbytes = recvfrom(sockfd, buf, sizeof(buf), 0, NULL, NULL);
if(numbytes > 0)
{
//Some more lines here, to read packet number
break;
}
}
for (i=0;i<14;i++)
{
if (sendto(sockfd,(void *)(sym) , sizeof(sym), 0, NULL, NULL) < 0)
perror("Send failed\n");
}
}
I measure the receive time by taking timestamps (using clock_gettime) before the recvfrom call and one after it, I print the time differences of these timestamps and print them whenever the time difference exceeds allowable range of 900-1100 us.
The problem I am facing is that the packet receive time is fluctuating.Something like this (the prints are in microseconds)
Decode Time : 1234
Decode Time : 762
Decode Time : 1593
Decode Time : 406
Decode Time : 1703
Decode Time : 257
Decode Time : 1493
Decode Time : 514
and so on..
And sometimes the decode times exceed 2000us and application would break.
In this situation, application would break anywhere between 2 seconds to a few minutes.
Options tried by me till now.
Setting affinity to a particular isolated core.
Setting scheduling priorities to maximum with SCHED_FIFO
Increase socket buffer sizes
Setting network interface interrupt affinity to the same core which processes application
Spinning over recvfrom using poll(),select() calls.
All these options give a significant improvement over initial version of code. Now the application would run for ~1-2 hours. But this is still not enough.
A few observations:
I get a a huge dump of these decode time prints, whenever I take ssh sessions to Linux machine while the application is running (which makes me think network communication over other 1G Ethernet interface is creating interference with the 10G Ethernet interface).
The application performs better in RHEL (run times of about 2-3 hours) than Centos (run times of about 30 mins - 1.5 hours)
The run times is also varying with Linux machines with different hardware configurations with same OS.
Please suggest if there are any other methods to improve the run-time of the application.
Thanks in advance.

First, you need to verify the accuracy of the timestamping method; clock_gettime. The resolution is nanoseconds, but the accuracy and precision is in question. That is not the answer to your problem, but informs on how reliable the timestamping is before proceeding. See Difference between CLOCK_REALTIME and CLOCK_MONOTONIC? for why CLOCK_MONOTONIC should be used for your application.
I suspect the majority of the decode time fluctuation is either due to a variable number of operations per decode, context switching of the operating system, or IRQs.
Operations per decode I cannot comment on since the code has been simplified in your post. This issue can also be profiled and inspected.
Context switching per process can be easily inspected and monitored https://unix.stackexchange.com/a/84345
As Ron stated, these are very strict timing requirements for a network. It must be an isolated network, and single purpose. Your observation regarding decode over-time when ssh'ing indicates all other traffic must be prevented. This is disturbing, given separate NICs. Thus I suspect IRQs are the issue. See /proc/interrupts.
To achieve consistent decode times over long intervals (hours->days) will require drastically simplifying the OS. Removing unnecessary processes and services, hardware, and perhaps building your own kernel. All for the goal of reducing context switching and interrupts. At which point a real-time OS should be considered. This will only improve the probability of consistent decode time, not guarantee.
My work is developing a data acquisition system that is a combination of FPGA ADC, PC, and ethernet. Inevitably, the inconsistency of a multi-purpose PC means certain features must be moved to dedicated hardware. Consider the Pros/Cons of developing your application for PC versus moving it to hardware.

Related

Implementing Primary NTP Server (GPS Receiver)

I'm trying to implement an NTP server based on an NMEA GPS Receiver. I'm not sure what to fill the root delay field with.
I've read the NTPv4 specification and it's written that root delay is the total round-trip delay to the reference clock.
If I'm working with a secondary server, root delay can be calculated from the time difference between the timestamps when making the packet requests with the reference server (am I correct?).
But I'm not sure what to fill it with if I'm using a GPS Receiver as the reference clock, should I fill it with 0 instead?

It will depend largely on how you're setting the time in your server from the GPS. If you're reading the NMEA sentence, interpreting it and setting the clock, the root delay would be the time taken to do that. But it wouldn't be a very good clock; there's a lot of non-deterministic delays (jitter) involved in reading RS232 (assuming that is how you're connected to the GPS).
You can use the 1 pulse per second output of a GPS receiver to fix that. It's normally on the Data Carrier Detect pin. Using a proper RS232 port (not a USB one) you can have the server's clock synchronised to that (DCD can be used to raise an interrupt), so now you get very good alignment to GPS time. This could certainly be done in Solaris (a native part of the kernel), and in Linux too (http://support.ntp.org/bin/view/Support/ConfiguringNMEARefclocks). If you're doing this then I think that the root delay would be small, but there's the matter of the OS and hardware's response time to interrupts.
EDIT
According to this NTP docs page,
Root Delay
This is the total roundtrip delay to the primary reference source at
the root of the synchronization subnet, in seconds. Note that this
variable can take on both positive and negative values, depending on
clock Precision and Skew.
So with 1PPS it's going to be pretty low. So far as I can tell it's a field that a secondary NTP server uses to tell its clients what its delay to a reference clock is. So if you have a 1PPS locked GPS time source, you are a reference clock. In which case, perhaps zero is correct enough; I don't think that NTP can achieve cross-network time synchronisation accuracies (1ms at best) better than the IRQ response time of a computer (< 50us hopefully with a good CONFIG_PREEMPT_RT linux kernel with nothing else going on).

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

UDP sendto performance over loopback

Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229

Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.

i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.

EPP port watchdog timer: how does it work?

I am working on a project that involves fast data acquisition (a scientific experiment). I will build an MCU-based module that will supply (at its fastest rate) 2 to 4 bytes of data every 10 microsecond. This data will have to be transferred to a PC in real time for further processing. In order to keep the cost of equipment low I have chosen to use the Enhanced Parallel Port (EPP) of the PC for connection. Its data rate (500 KB/s to 2 MB/s) should be sufficient.
The control program will be programmed in C and will run under DOS (I use DJGPP) and the EPP port will be handled by direct I/O port reading/writing for maximum efficiency.
Unfortunately, most of the documents I found on the net about programming the EPP port are badly written and confusing. My first request is actually for a pointer/link to a comprehensive document that would clearly and logically explain the operation of the EPP port.
Anyway, I managed to find out most of the things I needed, but there is one thing that baffles me. The documents mention a 'watchdog timer' in the EPP port that will set bit 0 of the status register if there is no response from the attached device in about 10 usec. One of the docs even suggests to monitor and reset this status bit if it goes active. AFAIK it is nonsense: the status port is read-only. So how does this watchdog timer really work? I assume that the logical way would be for the LPT controller circuit to reset this bit every time a new read or write operation is initiated. Is this assumption correct? If not, how should I handle this signal?

Why does the measured network latency change if I use a sleep?

I'm trying to determine the time that it takes for a machine to receive a packet, process it and give back an answer.
This machine, that I'll call 'server', runs a very simple program, which receives a packet (recv(2)) in a buffer, copies the received content (memcpy(3)) to another buffer and sends the packet back (send(2)). The server runs NetBSD 5.1.2.
My client measures the round-trip time a number of times (pkt_count):
struct timespec start, end;
for(i = 0; i < pkt_count; ++i)
{
printf("%d ", i+1);
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
//struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
//nanosleep(&nsleep, NULL);
printf("%.3f ", timespec_diff_usec(&end, &start));
}
I removed error checks and other minor things for clarity. The client runs on an Ubuntu 12.04 64-bit. Both programs run in real-time priority, although only the Ubuntu kernel is real time (-rt). The connection between the programs is TCP. This works fine and gives me an average of 750 microseconds.
However, if I enable the commented out nanosleep call (with a sleep of 100 µs), my measures drop 100 µs, giving an average of 650 µs. If I sleep for 200 µs, the measures drop to 550 µs, and so on. This goes up until a sleep of 600 µs, giving an average of 150 µs. Then, if I raise the sleep to 700 µs, my measures go way up to 800 µs in average. I confirmed my program's measures with Wireshark.
I can't figure out what is happening. I already set the TCP_NODELAY socket option in both client and server, no difference. I used UDP, no difference (same behavior). So I guess this behavior is not due to the Nagle algorithm. What could it be?
[UPDATE]
Here's a screenshot of the output of the client together with Wireshark. Now, I ran my server in another machine. I used the same OS with the same configuration (as it is a Live System in a pen drive), but the hardware is different. This behaviour didn't show up, everything worked as expected. But the question remains: why does it happen in the previous hardware?
[UPDATE 2: More info]
As I said before, I tested my pair of programs (client/server) in two different server computers. I plotted the two results obtained.
The first server (the weird one) is a RTD Single Board Computer, with a 1Gbps Ethernet interface. The second server (the normal one) is a Diamond Single Board Computer with a 100Mbps Ethernet interface. Both of them run the SAME OS (NetBSD 5.1.2) from the SAME pendrive.
From these results, I do believe that this behaviour is due either to the driver or the to NIC itself, although I still can't imagine why it happens...

OK, I reached a conclusion.
I tried my program using Linux, instead of NetBSD, in the server. It ran as expected, i.e., no matter how much I [nano]sleep in that point of the code, the result is the same.
This fact tells me that the problem might lie in the NetBSD's interface driver. To identify the driver, I read the dmesg output. This is the relevant part:
wm0 at pci0 dev 25 function 0: 82801I mobile (AMT) LAN Controller, rev. 3
wm0: interrupting at ioapic0 pin 20
wm0: PCI-Express bus
wm0: FLASH
wm0: Ethernet address [OMMITED]
ukphy0 at wm0 phy 2: Generic IEEE 802.3u media interface
ukphy0: OUI 0x000ac2, model 0x000b, rev. 1
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
So, as you can see, my interface is called wm0. According to this (page 9) I should check which driver is loaded by consulting the file sys/dev/pci/files.pci, line 625 (here). It shows:
# Intel i8254x Gigabit Ethernet
device wm: ether, ifnet, arp, mii, mii_bitbang
attach wm at pci
file dev/pci/if_wm.c wm
Then, searching through the driver source code (dev/pci/if_wm.c, here), I found a snippet of code that might change the driver behavior:
/*
* For N interrupts/sec, set this value to:
* 1000000000 / (N * 256). Note that we set the
* absolute and packet timer values to this value
* divided by 4 to get "simple timer" behavior.
*/
sc->sc_itr = 1500; /* 2604 ints/sec */
CSR_WRITE(sc, WMREG_ITR, sc->sc_itr);
Then I changed this 1500 value to 1 (trying to increase the number of interrupts per second allowed) and to 0 (trying to eliminate the interrupt throttling altogether), but both of these values produced the same result:
Without nanosleep: latency of ~400 us
With a nanosleep of 100 us: latency of ~230 us
With a nanosleep of 200 us: latency of ~120 us
With a nanosleep of 260 us: latency of ~70 us
With a nanosleep of 270 us: latency of ~60 us (minimum latency I could achieve)
With a nanosleep of anything above 300 us: ~420 us
This is, at least better behaved than the previous situation.
Therefore, I concluded that the behavior is due to the interface driver of the server. I am not willing to investigate it further in order to find other culprits, as I am moving from NetBSD to Linux for the project involving this Single Board Computer.

This is a (hopefully educated) guess, but I think it might explain what you're seeing.
I'm not sure how real time the linux kernel is. It might not be fully pre-emptive... So, with that disclaimer, continuing :)...
Depending on the scheduler, a task will possibly have what is called a "quanta", which is just an ammount of time it can run for before another task of same priority will be scheduled in. If the kernel is not fully pre-emptive, this might also be the point where a higher priority task can run. This depends on the details of the scheduler which I don't know enough about.
Anywhere between your first gettime and second gettime your task may be pre-empted. This just means that it is "paused" and another task gets to use the CPU for a certain amount of time.
The loop without the sleep might go something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
// Because you got pre-empted, your time measurement is artifically long
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
and so on....
When you put the nanosecond sleep in, this is most likely a point where the scheduler is able to run before the current task's quanta expires (the same would apply to recv() too, which blocks). So perhaps what you get is something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
nanosleep(&nsleep, NULL);
<----- PREMPTION .. nanosleep allows the scheduler to kick in because this is a pre-emption point
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
// Now it so happens that because your task got prempted where it did, the time
// measurement has not been artifically increased. Your task then can fiish the rest of
// it's quanta
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
... and so on
Some kind of interleaving will then occur where sometimes you are prempted between the two gettime()'s and sometimes outside of them because of the nanosleep. Depending on x, you might hit a sweet spot where you happen (by chance) to get your pre-emption point, on average, to be outside your time measurement block.
Anyway, that's my two-pennies worth, hope it helps explain things :)
A little note on "nanoseconds" to finish with...
I think one needs to be cautious with "nanoseconds" sleep. The reason I say this is that I think it is unlikely that an average computer can actually do this unless it uses special hardware.
Normally an OS will have a regular system "tick", generated at perhaps 5ms. This is an interrupt generated by say a RTC (Real Time Clock - just a bit of hardware). Using this "tick" the system then generates it's internal time representation. Thus, the average OS will only have a time resolution of a few milliseconds. The reason that this tick is not faster is that there is a balance to be achieved between keeping a very accurate time and not swamping the system with timer interrupts.
Not sure if I'm a little out of date with your average modern PC... I think some of them do have higher res timers, but still not into the nanosecond range and they might even struggle at 100uS.
So, in summary, keep in mind that the best time resolution you're likely to get is normally in the milliseconds range.
EDIT: Just revisiting this and thought I'd add the following... doesn't explain what your seeing but might provide another avenue for investigation...
As mentioned the timing accuracy of the nanosleep is unlikely to be better than milliseconds. Also your task can be pre-empted which will also cause timing problems. There is also the problem that the time taken for a packet to go up the protocol stack can vary, as well as network delay.
One thing you could try is, if your NIC supports, IEEE1588 (aka PTP). If your NIC supports it, it can timestamp PTP event packets as they leave and enter the PHY. This will give you the bes tpossible estimate of network delay. This eliminates any problems you might have with software pre-emption etc etc. I know squat about Linux PTP I'm afraid, but you could try http://linuxptp.sourceforge.net/

I think 'quanta' is the best theory for explanation.
On linux it is context switch frequency.
Kernel gives to process quanta time. But process is preempted in two situations:
Process call system procedure
quanta time is ended
hardware interrupt is comming (from network, hdd, usb, clock, etc...)
Unused quanta time is assigned to another ready to run process, using priorities/rt etc.
Actually context switch frequency is configured at 10000 times per second, it gives about 100us for quanta. but content switching takes a time, it is cpu depended, see this:
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
i dont understad, why content swith frequency is that high but it is discussion for linux kernel forum.
partially similar problem you can find here:
https://serverfault.com/questions/14199/how-many-context-switches-is-normal-as-a-function-of-cpu-cores-or-other

If the amount of data being sent by the application is large and fast enough, it could be filling the kernel buffers, which leads to a delay on each send(). Since the sleep is outside the measured section, it would then be eating the time that would otherwise be spent blocking on the send() call.
One way to help check for this case would be to run with a relatively small number of iterations, and then a moderate number of iterations. If the problem occurs with a small number of iterations (say 20) with small packet sizes (say <1k), then this is likely an incorrect diagnoses.
Keep in mind that your process and the kernel can easily overwhelm the network adapter and the wire-speed of the ethernet (or other media type) if sending data in a tight loop like this.
I'm having trouble reading the screen shots. If wireshark shows a constant rate of transmission on the wire, then it suggests this is the correct diagnoses. Of course, doing the math - dividing the wirespeed by the packet size (+ header) - should give an idea of the maximum rate at which the packets can be sent.
As for the 700 microseconds leading to increased delay, that's harder to determine. I don't have any thoughts on that one.

I have an advice on how to create a more accurate performance measurement.
Use the RDTSC instruction (or even better the intrinsic __rdtsc() function). This involves reading a CPU counter without leaving ring3 (no system call).
The gettime functions almost always involve a system call which slows things down.
Your code is a little tricky as it involves 2 system calls (send/recv), but in general it is better to call sleep(0) before the first measurement to ensure that the very short measurement doesn't receive a context switch. Off course the time measurement (and Sleep()) code should be disabled/enabled via macros in performance sensitive functions.
Some operating systems can be tricked into raising your process priority by having your process release it execution time window (e.g. sleep(0)). On the next schedule tick, the OS (not all) will raise the priority of your process since it didn't finish running its execution time quota.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight