Nanosecond Precision Variable Timer for Kernel 4.X using CLOCK_REALTIME

Nanosecond Precision Variable Timer for Kernel 4.X using CLOCK_REALTIME - c

I have been working on timers for a while but still unable to get a promising solution for my situation.
Basically, I want to send packets at specific time.
For example :
1st PACKET at 1486500720.000000000
-> wait -> nanosleep(1000000000)
2nd PACKET at 1486500721.000000000
-> wait -> nanosleep(1000000000)
3rd PACKET at 1486500722.000000000
-> wait -> nanosleep(1000000000)
4th PACKET at 1486500723.000000000
The time gap between them is exactly 1.000000000 second but when I send packet, each time it takes different time.
For example for 1st Packet, it takes 0.005025045 seconds to send it and then the nanosleep start.
So, my second packet is sent at 486500721.005025045 instead of 1486500721.000000000.
So everytime I have to adjust the nanosleep value by using clockgettime(CLOCK_REALTIME) by subtracting the time remaining with including offset of getime command overhead.
As I have to this in a loop with nanosecond precision (I know it is not possible but I want it to be as specific as possible), I use simple for loop.
My question is there any better way to do this with more precision ? I am on Kernel 4.4, so are you aware of any approach which is working for newer kernels or any other approach likely to be more precise than mine ?

You should use clock_nanosleep, which allows an absolute time to wait until, rather than nanosleep, which only uses relative times. However there's no way you're going to get the degree of precision you're asking for. Just returning to userspace after the clock expires is going to take at least a few hundred nanoseconds if not a thousand or more. And then there's unpredictable latency to send the packet after you wake, too.
At best you might be able to tune your wake time too be "close enough" for your purposes.

Related

Measuring Elapsed Time In Linux (CLOCK_MONOTONIC vs. CLOCK_MONOTONIC_RAW)

I am currently trying to talk to a piece of hardware in userspace (underneath the hood, everything is using the spidev kernel driver, but that's a different story).
The hardware will tell me that a command has been completed by indicating so with a special value in a register, that I am reading from. The hardware also has a requirement to get back to me in a certain time, otherwise the command has failed. Different commands take different times.
As a result, I am implementing a way to set a timeout and then check for that timeout using clock_gettime(). In my "set" function, I take the current time and add the time interval I should wait for (usually this anywhere from a few ms to a couple of seconds). I then store this value for safe keeping later.
In my "check" function, I once again, get the current time and then compare it against the time I have saved. This seems to work as I had hoped.
Given my use case, should I be using CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW? I'm assuming CLOCK_MONOTONIC_RAW is better suited, since I have short intervals that I am checking. I am worried that such a short interval might represent a system-wide outlier, in which NTP was doing alot of adjusting. Note that my target system is only Linux kernels 4.4 and newer.
Thanks in advance for the help.
Edited to add: given my use case, I need "wall clock" time, not CPU time. That is, I am checking to see if the hardware has responded in some wall clock time interval.
References:
Rutgers Course Notes
What is the difference between CLOCK_MONOTONIC & CLOCK_MONOTONIC_RAW?
Elapsed Time in C Tutorial

how to calculate the received packets rate on a linux based pc?like pps or fps

I am writing a network program which can calculate accurate data packet rate (packet per second, frame per second, bps). Now i have a device called testcenter which can send accurate flow to a specific pc (protocol is UDP/IP) on Linux, i like to know the accurate pps(packets per second) with my program , i have considered the gettimeofday(&start,NULL)function before i call recvfrom() and update the counter for packets, after that call gettimeofday(&end,NULL) and get the pps rate. I hope there is better solution than this since the user/kernel barrier is traversed on system calls.
Best regards.

I think you should use clock_gettime() with CLOCK_MONOTONIC_COARSE. But it will only be accurate till the last tick .. So may be off by 10s of millisec. But its definitely faster that using it with CLOCK_MONOTONIC_RAW. You can also use gettimeofday but clock_gettime with CLOCK_MONOTONIC_RAW is slightly faster and higher resolution than gettimeofday.
Also gettimeofday() gives wall clock time, which might change even for daylight saving ... I don't think you should use it to measure traffic rate.

Your observation that gettimeofday switches to kernel mode is incorrect for Linux on a few popular architectures due to the use of vsyscalls. Clearly using gettimeofday here is not a bad option. You should however consider using a monotonic clock, see man 3 clock_gettime. Note that clock_gettime is not yet converted to vsyscall for as many architectures as gettimeofday.
Beyond this option you may be able to set the SO_TIMESTAMP socket option and obtain precise timestamps via recvmsg.

How to generate requests at a "requests/sec" target rate?

Say I have a target of x requests/sec that I want to generate continuously. My goal is to start these requests at roughly the same interval, rather than just generating x requests and then waiting until 1 second has elapsed and repeating the whole thing over and over again. I'm not making any assumptions about these requests, some might take much longer than others, which is why my scheduler thread will not perform the requests (or wait for them to finish), but hand them over to a sufficiently sized Thread Pool.
Now if x is in the range of hundreds or less, I might get by with .net's Timers or Thread.Sleep and checking actually elapsed time using Stopwatch.
But if I want to go into the thousands or tens of thousands, I could try going high-resolution timer to maintain my roughly the same interval approach. But this would (in most programming environments on a general OS) imply some amount of hand-coding with spin waiting and so forth, and I'm not sure it's worthwhile to take this route.
Extending the initial approach, I could instead use a Timer to sleep and do y requests on each Timer event, monitor the actual requests per second achieved doing this and fine-tune y at runtime. The effect is somewhere in between "put all x requests and wait until 1 second elapsed since start", which I'm trying not to do, and "wait more or less exactly 1/x seconds before starting the next request".
The latter seems like a good compromise, but is there anything that's easier while still spreading the requests somewhat evenly over time? This must have been implemented hundreds of times by different people, but I can't find good references on the issue.
So what's the easiest way to implement this?

One way to do it:
First find (good luck on Windows) or implement a usleep or nanosleep function. As a first step, this could be (on .net) a simple Thread.SpinWait() / Stopwatch.Elapsed > x combo. If you want to get fancier, do Thread.Sleep() if the time span is large enough and only do the fine-tuning using Thread.SpinWait().
That done, just take the inverse of the rate and you have the time interval you need to sleep between each event. Your basic loop, which you do on one dedicated thread, then goes
Fire event
Sleep(sleepTime)
Then every, say, 250ms (or more for faster rates), check the actually achieved rate and adjust the sleepTime interval, perhaps with some smoothing to dampen wild temporary swings, like this
newRate = max(1, sleepTime / targetRate * actualRate)
sleepTime = 0.3 * sleepTime + 0.7 * newRate
This adjusts to what is actually going on in your program and on your system, and makes up for the time spent to invoke the event callback, and whatever the callback is doing on that same thread etc. Without this, you will probably not be able to get high accuracy.
Needless to say, if your rate is so high that you cannot use Sleep but always have to spin, one core will be spinning continuously. The good news: We get ever more cores on our machines, so one core matters less and less :) More serious though, as you mentioned in the comment, if your program does actual work, your event generator will have less time (and need) to waste cycles.
Check out https://github.com/EugenDueck/EventCannon for a proof of concept implementation in .net. It's implemented roughly as described above and done as a library, so you can embed that in your program if you use .net.

Why does the measured network latency change if I use a sleep?

I'm trying to determine the time that it takes for a machine to receive a packet, process it and give back an answer.
This machine, that I'll call 'server', runs a very simple program, which receives a packet (recv(2)) in a buffer, copies the received content (memcpy(3)) to another buffer and sends the packet back (send(2)). The server runs NetBSD 5.1.2.
My client measures the round-trip time a number of times (pkt_count):
struct timespec start, end;
for(i = 0; i < pkt_count; ++i)
{
printf("%d ", i+1);
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
//struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
//nanosleep(&nsleep, NULL);
printf("%.3f ", timespec_diff_usec(&end, &start));
}
I removed error checks and other minor things for clarity. The client runs on an Ubuntu 12.04 64-bit. Both programs run in real-time priority, although only the Ubuntu kernel is real time (-rt). The connection between the programs is TCP. This works fine and gives me an average of 750 microseconds.
However, if I enable the commented out nanosleep call (with a sleep of 100 µs), my measures drop 100 µs, giving an average of 650 µs. If I sleep for 200 µs, the measures drop to 550 µs, and so on. This goes up until a sleep of 600 µs, giving an average of 150 µs. Then, if I raise the sleep to 700 µs, my measures go way up to 800 µs in average. I confirmed my program's measures with Wireshark.
I can't figure out what is happening. I already set the TCP_NODELAY socket option in both client and server, no difference. I used UDP, no difference (same behavior). So I guess this behavior is not due to the Nagle algorithm. What could it be?
[UPDATE]
Here's a screenshot of the output of the client together with Wireshark. Now, I ran my server in another machine. I used the same OS with the same configuration (as it is a Live System in a pen drive), but the hardware is different. This behaviour didn't show up, everything worked as expected. But the question remains: why does it happen in the previous hardware?
[UPDATE 2: More info]
As I said before, I tested my pair of programs (client/server) in two different server computers. I plotted the two results obtained.
The first server (the weird one) is a RTD Single Board Computer, with a 1Gbps Ethernet interface. The second server (the normal one) is a Diamond Single Board Computer with a 100Mbps Ethernet interface. Both of them run the SAME OS (NetBSD 5.1.2) from the SAME pendrive.
From these results, I do believe that this behaviour is due either to the driver or the to NIC itself, although I still can't imagine why it happens...

OK, I reached a conclusion.
I tried my program using Linux, instead of NetBSD, in the server. It ran as expected, i.e., no matter how much I [nano]sleep in that point of the code, the result is the same.
This fact tells me that the problem might lie in the NetBSD's interface driver. To identify the driver, I read the dmesg output. This is the relevant part:
wm0 at pci0 dev 25 function 0: 82801I mobile (AMT) LAN Controller, rev. 3
wm0: interrupting at ioapic0 pin 20
wm0: PCI-Express bus
wm0: FLASH
wm0: Ethernet address [OMMITED]
ukphy0 at wm0 phy 2: Generic IEEE 802.3u media interface
ukphy0: OUI 0x000ac2, model 0x000b, rev. 1
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
So, as you can see, my interface is called wm0. According to this (page 9) I should check which driver is loaded by consulting the file sys/dev/pci/files.pci, line 625 (here). It shows:
# Intel i8254x Gigabit Ethernet
device wm: ether, ifnet, arp, mii, mii_bitbang
attach wm at pci
file dev/pci/if_wm.c wm
Then, searching through the driver source code (dev/pci/if_wm.c, here), I found a snippet of code that might change the driver behavior:
/*
* For N interrupts/sec, set this value to:
* 1000000000 / (N * 256). Note that we set the
* absolute and packet timer values to this value
* divided by 4 to get "simple timer" behavior.
*/
sc->sc_itr = 1500; /* 2604 ints/sec */
CSR_WRITE(sc, WMREG_ITR, sc->sc_itr);
Then I changed this 1500 value to 1 (trying to increase the number of interrupts per second allowed) and to 0 (trying to eliminate the interrupt throttling altogether), but both of these values produced the same result:
Without nanosleep: latency of ~400 us
With a nanosleep of 100 us: latency of ~230 us
With a nanosleep of 200 us: latency of ~120 us
With a nanosleep of 260 us: latency of ~70 us
With a nanosleep of 270 us: latency of ~60 us (minimum latency I could achieve)
With a nanosleep of anything above 300 us: ~420 us
This is, at least better behaved than the previous situation.
Therefore, I concluded that the behavior is due to the interface driver of the server. I am not willing to investigate it further in order to find other culprits, as I am moving from NetBSD to Linux for the project involving this Single Board Computer.

This is a (hopefully educated) guess, but I think it might explain what you're seeing.
I'm not sure how real time the linux kernel is. It might not be fully pre-emptive... So, with that disclaimer, continuing :)...
Depending on the scheduler, a task will possibly have what is called a "quanta", which is just an ammount of time it can run for before another task of same priority will be scheduled in. If the kernel is not fully pre-emptive, this might also be the point where a higher priority task can run. This depends on the details of the scheduler which I don't know enough about.
Anywhere between your first gettime and second gettime your task may be pre-empted. This just means that it is "paused" and another task gets to use the CPU for a certain amount of time.
The loop without the sleep might go something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
// Because you got pre-empted, your time measurement is artifically long
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
and so on....
When you put the nanosecond sleep in, this is most likely a point where the scheduler is able to run before the current task's quanta expires (the same would apply to recv() too, which blocks). So perhaps what you get is something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
nanosleep(&nsleep, NULL);
<----- PREMPTION .. nanosleep allows the scheduler to kick in because this is a pre-emption point
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
// Now it so happens that because your task got prempted where it did, the time
// measurement has not been artifically increased. Your task then can fiish the rest of
// it's quanta
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
... and so on
Some kind of interleaving will then occur where sometimes you are prempted between the two gettime()'s and sometimes outside of them because of the nanosleep. Depending on x, you might hit a sweet spot where you happen (by chance) to get your pre-emption point, on average, to be outside your time measurement block.
Anyway, that's my two-pennies worth, hope it helps explain things :)
A little note on "nanoseconds" to finish with...
I think one needs to be cautious with "nanoseconds" sleep. The reason I say this is that I think it is unlikely that an average computer can actually do this unless it uses special hardware.
Normally an OS will have a regular system "tick", generated at perhaps 5ms. This is an interrupt generated by say a RTC (Real Time Clock - just a bit of hardware). Using this "tick" the system then generates it's internal time representation. Thus, the average OS will only have a time resolution of a few milliseconds. The reason that this tick is not faster is that there is a balance to be achieved between keeping a very accurate time and not swamping the system with timer interrupts.
Not sure if I'm a little out of date with your average modern PC... I think some of them do have higher res timers, but still not into the nanosecond range and they might even struggle at 100uS.
So, in summary, keep in mind that the best time resolution you're likely to get is normally in the milliseconds range.
EDIT: Just revisiting this and thought I'd add the following... doesn't explain what your seeing but might provide another avenue for investigation...
As mentioned the timing accuracy of the nanosleep is unlikely to be better than milliseconds. Also your task can be pre-empted which will also cause timing problems. There is also the problem that the time taken for a packet to go up the protocol stack can vary, as well as network delay.
One thing you could try is, if your NIC supports, IEEE1588 (aka PTP). If your NIC supports it, it can timestamp PTP event packets as they leave and enter the PHY. This will give you the bes tpossible estimate of network delay. This eliminates any problems you might have with software pre-emption etc etc. I know squat about Linux PTP I'm afraid, but you could try http://linuxptp.sourceforge.net/

I think 'quanta' is the best theory for explanation.
On linux it is context switch frequency.
Kernel gives to process quanta time. But process is preempted in two situations:
Process call system procedure
quanta time is ended
hardware interrupt is comming (from network, hdd, usb, clock, etc...)
Unused quanta time is assigned to another ready to run process, using priorities/rt etc.
Actually context switch frequency is configured at 10000 times per second, it gives about 100us for quanta. but content switching takes a time, it is cpu depended, see this:
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
i dont understad, why content swith frequency is that high but it is discussion for linux kernel forum.
partially similar problem you can find here:
https://serverfault.com/questions/14199/how-many-context-switches-is-normal-as-a-function-of-cpu-cores-or-other

If the amount of data being sent by the application is large and fast enough, it could be filling the kernel buffers, which leads to a delay on each send(). Since the sleep is outside the measured section, it would then be eating the time that would otherwise be spent blocking on the send() call.
One way to help check for this case would be to run with a relatively small number of iterations, and then a moderate number of iterations. If the problem occurs with a small number of iterations (say 20) with small packet sizes (say <1k), then this is likely an incorrect diagnoses.
Keep in mind that your process and the kernel can easily overwhelm the network adapter and the wire-speed of the ethernet (or other media type) if sending data in a tight loop like this.
I'm having trouble reading the screen shots. If wireshark shows a constant rate of transmission on the wire, then it suggests this is the correct diagnoses. Of course, doing the math - dividing the wirespeed by the packet size (+ header) - should give an idea of the maximum rate at which the packets can be sent.
As for the 700 microseconds leading to increased delay, that's harder to determine. I don't have any thoughts on that one.

I have an advice on how to create a more accurate performance measurement.
Use the RDTSC instruction (or even better the intrinsic __rdtsc() function). This involves reading a CPU counter without leaving ring3 (no system call).
The gettime functions almost always involve a system call which slows things down.
Your code is a little tricky as it involves 2 system calls (send/recv), but in general it is better to call sleep(0) before the first measurement to ensure that the very short measurement doesn't receive a context switch. Off course the time measurement (and Sleep()) code should be disabled/enabled via macros in performance sensitive functions.
Some operating systems can be tricked into raising your process priority by having your process release it execution time window (e.g. sleep(0)). On the next schedule tick, the OS (not all) will raise the priority of your process since it didn't finish running its execution time quota.

Calculating CAN bus speed

I need to validate and characterize CAN bus traffic for our product (call it the Unit Under Test, UUT). I have a machine that sends a specified number of can frames to our product. Our product is running a Linux based custom kernel. The CAN frames are pre-built in software on the sender machine using a specific algorithm. The UUT uses the algorithm to verify the received frames.
Also, and here is where my questions lie, I am trying to calculate some timing data in the UUT software. So I basically do a read loop as fast as possible. I have a pre-allocated buffer to store the frames, so I just call read and increment the pointer to the buffer:
clock_gettime(clocK_PROCESS_CPUTIME_ID, timespec_start_ptr);
while ((frames_left--) > 0)
read(can_sock_fd, frame_mem_ptr++, sizeof(struct can_frame));
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, timespec_stop_ptr);
My question has to do with the times I get when I calculate the difference in these two timespecs (the calculation I use is correct I have verified it, it is GNUs algorithm).
Also, running the program under the time utility agrees with my times. For example, my program is called tcan, so I might run
[prompt]$ time ./tcan can1 -nf 10000
to run on can1 socket with 10000 frames. (This is FlexCAN, socket based interface, BTW)
Then, I use the time difference to calculate the data transfer speed I obtained. I received num_frames in the time span, so I calculate the frames/sec and the bits/sec
I am getting bus speeds that are 10 times the CAN bus speed of 250000 bits per sec. How can this be? I only get 2.5% CPU utilization according to both my program and the time program (and the top utility as well).
Are the values I am calculating meaningful? Is there something better I could do? I am assuming that since time reports real times that are much greater than user+sys, there must be some time-accounting lost somewhere. Another possibility is that maybe it's correct, I don't know, it's puzzling.

This is kind of a long shot, but what if read() is returning early because otherwise it would have to wait for incoming data? The fastest data to read is none at all :)
It would mess up the timings, but have you tried doing this loop whilst error checking? Or implement the loop via a recv() which should block unless you have asked it not to?
Hopefully this helps.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight