I need to add a delay into my code of n CPU cycles (~30).
My current solution is the one below, which works but isn't very elegant.
Also, the delay has to be known at compile time. I can work with this, but it would be ideal if I could change the delay at runtime.
(It is OK if there is some overhead, but I need the 1 cycle resolution.)
I do not have any peripheral timers left, that I could use, so it needs to be a software solution.
do_something();
#define NUMBER_OF_NOPS (SOME_DELAY + 3)
#include "nops.h"
#undef NUMBER_OF_NOPS
do_the_next_thing();
nops.h:
#if NUMBER_OF_NOPS > 0
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 1
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 2
__ASM volatile ("nop");
#endif
...
In the cortex devices NOP is something which literally means nothing. There is no guarantee that the NOP will consume any time.They are used for padding only. I you will have several consecutive NOPs they will just be flushed from the pipeline.
For more information refer to the Cortex-M0 documentation. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CHDJJGFB.html
software delays are quite tricky in the Cortex devices and you should use other instructions + possibly barrier instructions instead.
use ISB instructions 4 clocks + flash access time which depend what speed the core is running. For very precise delays place this part of code in the SRAM
Edit: There is a better answer from another SO Q&A here. However it is in assembly, AFAIK using a counter like SysTick is the only way to guarantee any semblance of cycle accuracy.
Edit 2: To avoid a counter overflow, which would result in a very, very long delay, clear the SysTick counter before use, ie. SysTick->VAL = 0;
Original:
Cortex-Ms have a built in timer called SysTick which can be used for cycle accurate timing purposes.
First enable the timer:
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk |
SysTick_CTRL_ENABLE_Msk;
Then you can read the current count using the VAL register. You can then implement a cycle accurate delay this way:
int count = SysTick->VAL;
while(SysTick->VAL < (count+30));
Note that this will introduce some overhead because of the load, compare and branch in the loop so the final cycle count will be a little off, no more than a few ticks in my estimation.
You can use a free-running up-counter as follows:
uint32_t t = <periph>.count;
while ((<periph>.count - t) < delay);
As long as delay is less than half the period of the counter, this is unaffected by wrapping of the counter value - the unsigned arithmetic produces the correct time delta.
Note that since you don't need to control the counter's value in any way, you can use any such counter in the system - even if it's being used for another purpose (as long, of course, as it really is running continuously and freely, and at a rate that gives you the timing resolution that you require).
Related
I'm trying to establish what practical jitter I can achieve by using clock_nanosleep() in a loop and through experimentation I'm observing something I'm not confident I understand.
I'm using code posted in this SO question by another user to benchmark performance, targeting a 250ms interval. I've observed that on my system the sleep function returns very consistently 10us late with only about 2us jitter the vast majority of the time (fairly narrow statistical distribution).
NOTE: I haven't collected data to present a plot of statistical distribution but casual qualitative description should hopefully suffice.
I decided to subtract the 10us offset from the target wakeup time to compensate for it, and this caused the average error to be approximately zero as expected, however the jitter increased dramatically - I would estimate most wakeups are >100us early/late, and much more widely distributed.
Why is this?
My theory is that with the 10us correction the target waketimes are less nicely aligned with the underlying hardware clock, but it would be helpful to get confirmation. If this is true, is there a method to synchronize the phase of the target waketimes with the hardware clock?
Manpages for clock_nanosleep(2) say: "Furthermore, after the
sleep completes, there may still be a delay before the CPU
becomes free to once again execute the calling thread."
I tried to comprehend your question. For this I created the source code below based on the reference at SO which you provided. I include the source code such that you or someone else can check it, test it, play with it.
The debug print refers to a sleep of exactly 1 second. The debug print is shorter than the print in the comments - and the debug print will always refer to the deviation from 1 second, no matter which wakeTime has been defined. Thus, it is possible, to try a reduced wakeTime (wakeTime.tv_nsec-= some_value;) to achieve the target of 1 second.
Conclusions:
I would generally agree to all you (davegravy) write about it in your post, except that I am seeing much higher delays and deviations.
There are minor changes in the delay between a non-loaded and a heavy loaded system (all CPUs 100% load). On heavy loaded system scattering of delay reduces and the average delay also reduces (on my system - but not very significant).
As expected, the delay changes quite a bit when I try it on another machine (as expected raspberry pi is worse :o).
For a specific machine and moment it is possible to define a correction value of nanoseconds to bring the average sleep closer to the target. Anyway, the correction value is not necessarily equal to the delay error without correction. And the correction value might be different for different machines.
Idea: As the provided code can measure how good it is. There might be the chance, that the code does a few loops from which it can derive an optimized delay correction value by itself. (This auto-correction might be interesting just from a theoretical point of view. Well, it is an idea.)
Idea 2: Or some correction values can be created just to avoid a long-term shift when considering many intervals, one after another.
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#define CLOCK CLOCK_MONOTONIC
//#define CLOCK CLOCK_REALTIME
//#define CLOCK CLOCK_TAI
//#define CLOCK CLOCK_BOOTTIME
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void* tickThread()
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
while(1)
{
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 1;
wakeTime.tv_nsec -= 0; // Value to play with for delay "correction"
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, ¤tTime);
sleepTime = calcTimeDiff(¤tTime, &sleepStart);
wakeDelay = calcTimeDiff(¤tTime, &wakeTime);
{
/*printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);*/
// Debug Short Print with respect to target sleep = 1 sec. = 1000000000 ns
long debugTargetDelay=sleepTime-1000000000;
printf("sleep=%-ld delay=%-ld targetdelay=%-ld\n",
sleepTime, wakeDelay, debugTargetDelay);
}
}
}
int main(int argc, char*argv[])
{
tickThread();
}
Some output with wakeTime.tv_nsec -= 0;
sleep=1000095788 delay=96104 targetdelay=95788
sleep=1000078989 delay=79155 targetdelay=78989
sleep=1000080717 delay=81023 targetdelay=80717
sleep=1000068001 delay=68251 targetdelay=68001
sleep=1000080475 delay=80519 targetdelay=80475
sleep=1000110925 delay=110977 targetdelay=110925
sleep=1000082415 delay=82561 targetdelay=82415
sleep=1000079572 delay=79713 targetdelay=79572
sleep=1000098609 delay=98664 targetdelay=98609
and with wakeTime.tv_nsec -= 65000;
sleep=1000031711 delay=96987 targetdelay=31711
sleep=1000009400 delay=74611 targetdelay=9400
sleep=1000015867 delay=80912 targetdelay=15867
sleep=1000015612 delay=80708 targetdelay=15612
sleep=1000030397 delay=95592 targetdelay=30397
sleep=1000015299 delay=80475 targetdelay=15299
sleep=999993542 delay=58614 targetdelay=-6458
sleep=1000031263 delay=96310 targetdelay=31263
sleep=1000002029 delay=67169 targetdelay=2029
sleep=1000031671 delay=96821 targetdelay=31671
sleep=999998462 delay=63608 targetdelay=-1538
Anyway, the delays change all the time. I tried different CLOCK definitions and different compiler options, but without any special results.
Some statistics from further testing, sample size = 100 in both cases.
targetdelay from wakeTime.tv_nsec -= 0;
Mean value = 97503 Standard deviation = 27536
targetdelay from wakeTime.tv_nsec -= 97508;
Mean value = -1909 Standard deviation = 32682
In both cases, there were a few massive outliers, such that even this result from 100 samples might not quite be representative.
I wrote a basic code to find out the amount of clock cycles taken by nop. We know nop takes one clock cycle.
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint32_t low1, low2, high1, high2;
uint64_t timestamp1, timestamp2;
asm volatile ("rdtsc" : "=a"(low1), "=d"(high1));
asm("nop");
asm volatile ("rdtsc" : "=a"(low2), "=d"(high2));
timestamp1 = ((uint64_t)high1 << 32) | low1;
timestamp2 = ((uint64_t)high2 << 32) | low2;
printf("Diff:%lu\n", timestamp2 - timestamp1);
return 0;
}
But the output is not 1.
It is sometimes 14 or 16.
May i know the reason behind this. Am i missing anything
We know nop takes one clock cycle.
A modern CPU can be thought of as a pipeline of stages; where the front end might fetch and decode multiple instructions in parallel and put the resulting micro-ops into a buffer where they wait for their dependencies to be satisfied (before being taken by an execution unit, where multiple micro-ops can be executed at the same time by multiple execution units).
A NOP has no micro-ops - it's simply discarded by the front end. It doesn't cost 1 cycle.
But the output is not 1.
It probably takes 14 or 16 cycles for the instructions the compiler generates to deal with the outputs of the first rdtsc, then prepare things for the second rdtsc, then the second rdtsc itself.
Note that rdtsc probably counts the cycles of a fixed frequency timer that has nothing the CPU's current (variable) clock frequency; so 14 or 16 "time cycles" might be (e.g.) 7 or 8 CPU cycles.
My problem is that turning on and off my GPIO pin takes way too long, despite using good timekeeping functionality, including both ndelay from linux/delay.h and my own accurate_ndelay which (shown below) uses ktime_get_ns() from linux/ktime.h.
My kernel version is 4.19.38 with Armbian, running on an OrangePi Zero.
static inline void accurate_ndelay(uint16_t ns){
uint64_t s = ktime_get_ns();
uint64_t e = s + ns;
while(ktime_get_ns() < e);
}
static inline void unsafe_bit2812(struct WS2812* ws2812, uint8_t b){
if(b){
gpio_set_value(ws2812->pin, 1);
accurate_ndelay(ws2812->t0h);
gpio_set_value(ws2812->pin, 0);
accurate_ndelay(ws2812->t0l);
} else {
gpio_set_value(ws2812->pin, 1);
accurate_ndelay(ws2812->t1h);
gpio_set_value(ws2812->pin, 0);
accurate_ndelay(ws2812->t1l);
}
}
When I measure the real-world delay (as shown by my oscilloscope, not bad software). The delay is not the expected 350ns, but 920ns. Which for the WS2812 is 770ns too much!
That's some pretty tight timing. The OrangePi Zero run at 1.2 GHz, so 150 ns is 180 clock cycles. That doesn't give you time to do much.
The first thing to do is use ktime_get_ns() to just measure how long the gpio_set_value() call takes. Or remove the delay and measure it with the scope. You might know the answer already, if you delayed for 350ns and measured 920ns, then it takes about 600ns.
You're calling gpio_set_value(), which pulls in the safe, portable Linux gpio library. The maximum possible performance would be to write your own driver for the GPIO that goes right to the HW registers and sets the two states, with the delay, as a single action.
Even with a custom driver, you'll have delays introduced by the clock driving the gpio and the rise and fall times of the device.
I'm writing code for an embedded device (no OS so no system calls or anything) and I need to have a delay but the compiler doesn't supply time.h. What other options do I have?
Depending on the clock of your system you may implement delays using the NOP (no operation) assembler instruction. You may calculate the time of one NOP depending on the MIPS of your system, so for example if 1 NOP is 1[us], then you could implement something like:
void delay(int ms)
{
int i;
for (i = 0; i < ms*1000; i++)
{
asm(NOP);
}
}
Depends on the device. Can you enable a stable timer interrupt? You might only be able to busy-wait and wait for a timer interrupt. How accurate this is likely to be (and how accurate it needs to be) is unclear.
For short fixed time delays, a do-nothing loop will meet the need, but of course, will need calibration.
void Delay_ms(unsigned d /* ms */) {
while (d-- > 0) {
unsigned i;
i = 2800; // Calibrate this value
// Recommend that the flowing while loop's asm code is check for tightness.
while (--i);
/* add multiple _nop_() here should you want precise calibration */
}
}
The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles