I'm writing code for an embedded device (no OS so no system calls or anything) and I need to have a delay but the compiler doesn't supply time.h. What other options do I have?
Depending on the clock of your system you may implement delays using the NOP (no operation) assembler instruction. You may calculate the time of one NOP depending on the MIPS of your system, so for example if 1 NOP is 1[us], then you could implement something like:
void delay(int ms)
{
int i;
for (i = 0; i < ms*1000; i++)
{
asm(NOP);
}
}
Depends on the device. Can you enable a stable timer interrupt? You might only be able to busy-wait and wait for a timer interrupt. How accurate this is likely to be (and how accurate it needs to be) is unclear.
For short fixed time delays, a do-nothing loop will meet the need, but of course, will need calibration.
void Delay_ms(unsigned d /* ms */) {
while (d-- > 0) {
unsigned i;
i = 2800; // Calibrate this value
// Recommend that the flowing while loop's asm code is check for tightness.
while (--i);
/* add multiple _nop_() here should you want precise calibration */
}
}
Related
I noticed the io_uring kernel side uses CLOCK_MONOTONIC at CLOCK_MONOTONIC, so for the first timer, I get the time with both CLOCK_REALTIME and CLOCK_MONOTONIC and adjust the nanosecond like below and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout. iorn/clock.c at master · hnakamur/iorn
const long sec_in_nsec = 1000000000;
static int queue_timeout(iorn_queue_t *queue) {
iorn_timeout_op_t *op = calloc(1, sizeof(*op));
if (op == NULL) {
return -ENOMEM;
}
struct timespec rts;
int ret = clock_gettime(CLOCK_REALTIME, &rts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_REALTIME error: %s\n", strerror(errno));
return -errno;
}
long nsec_diff = sec_in_nsec - rts.tv_nsec;
ret = clock_gettime(CLOCK_MONOTONIC, &op->ts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_MONOTONIC error: %s\n", strerror(errno));
return -errno;
}
op->handler = on_timeout;
op->ts.tv_sec++;
op->ts.tv_nsec += nsec_diff;
if (op->ts.tv_nsec > sec_in_nsec) {
op->ts.tv_sec++;
op->ts.tv_nsec -= sec_in_nsec;
}
op->count = 1;
op->flags = IORING_TIMEOUT_ABS;
ret = iorn_prep_timeout(queue, op);
if (ret < 0) {
return ret;
}
return iorn_submit(queue);
}
From the second time, I just increment the second part tv_sec and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout.
Here is the output from my example program. The millisecond part is zero but it is about 400 microsecond later than just second.
on_timeout time=2020-05-10T14:49:42.000442
on_timeout time=2020-05-10T14:49:43.000371
on_timeout time=2020-05-10T14:49:44.000368
on_timeout time=2020-05-10T14:49:45.000372
on_timeout time=2020-05-10T14:49:46.000372
on_timeout time=2020-05-10T14:49:47.000373
on_timeout time=2020-05-10T14:49:48.000373
Could you tell me a better way than this?
Thanks for your comments! I'd like to update the current time for logging like ngx_time_update(). I modified my example to use just CLOCK_REALTIME, but still about 400 microseconds late. github.com/hnakamur/iorn/commit/… Does it mean clock_gettime takes about 400 nanoseconds on my machine?
Yes, that sounds about right, sort of. But, if you're on an x86 PC under linux, 400 ns for clock_gettime overhead may be a bit high (order of magnitude higher--see below). If you're on an arm CPU (e.g. Raspberry Pi, nvidia Jetson), it might be okay.
I don't know how you're getting 400 microseconds. But, I've had to do a lot of realtime stuff under linux, and 400 us is similar to what I've measured as the overhead to do a context switch and/or wakeup a process/thread after a syscall suspends it.
I never use gettimeofday anymore. I now just use clock_gettime(CLOCK_REALTIME,...) because it's the same except you get nanoseconds instead of microseconds.
Just so you know, although clock_gettime is a syscall, nowadays, on most systems, it uses the VDSO layer. The kernel injects special code into the userspace app, so that it is able to access the time directly without the overhead of a syscall.
If you're interested, you could run under gdb and disassemble the code to see that it just accesses some special memory locations instead of doing a syscall.
I don't think you need to worry about this too much. Just use clock_gettime(CLOCK_MONOTONIC,...) and set flags to 0. The overhead doesn't factor into this, for the purposes of the ioring call as your iorn layer is using it.
When I do this sort of thing, and I want/need to calculate the overhead of clock_gettime itself, I call clock_gettime in a loop (e.g. 1000 times), and try to keep the total time below a [possible] timeslice. I use the minimum diff between times in each iteration. That compensates for any [possible] timeslicing.
The minimum is the overhead of the call itself [on average].
There are additional tricks that you can do to minimize latency in userspace (e.g. raising process priority, clamping CPU affinity and I/O interrupt affinity), but they can involve a few more things, and, if you're not very careful, they can produce worse results.
Before you start taking extraordinary measures, you should have a solid methodology to measure timing/benchmarking to prove that your results can not meet your timing/throughput/latency requirements. Otherwise, you're doing complicated things for no real/measurable/necessary benefit.
Below is some code I just created, simplified, but based on code I already have/use to calibrate the overhead:
#include <stdio.h>
#include <time.h>
#define ITERMAX 10000
typedef long long tsc_t;
// tscget -- get time in nanoseconds
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
// tscsec -- convert nanoseconds to fractional seconds
double
tscsec(tsc_t tsc)
{
double sec;
sec = tsc;
sec /= 1e9;
return sec;
}
tsc_t
calibrate(void)
{
tsc_t tscbeg;
tsc_t tscold;
tsc_t tscnow;
tsc_t tscdif;
tsc_t tscmin;
int iter;
tscmin = 1LL << 62;
tscbeg = tscget();
tscold = tscbeg;
for (iter = ITERMAX; iter > 0; --iter) {
tscnow = tscget();
tscdif = tscnow - tscold;
if (tscdif < tscmin)
tscmin = tscdif;
tscold = tscnow;
}
tscdif = tscnow - tscbeg;
printf("MIN:%.9f TOT:%.9f AVG:%.9f\n",
tscsec(tscmin),tscsec(tscdif),tscsec(tscnow - tscbeg) / ITERMAX);
return tscmin;
}
int
main(void)
{
calibrate();
return 0;
}
On my system, a 2.67GHz Core i7, the output is:
MIN:0.000000019 TOT:0.000254999 AVG:0.000000025
So, I'm getting 25 ns overhead [and not 400 ns]. But, again, each system can be different to some extent.
UPDATE:
Note that x86 processors have "speed step". The OS can adjust the CPU frequency up or down semi-automatically. Lower speeds conserve power. Higher speeds are maximum performance.
This is done with a heuristic (e.g. if the OS detects that the process is a heavy CPU user, it will up the speed).
To force maximum speed, linux has this directory:
/sys/devices/system/cpu/cpuN/cpufreq
Where N is the cpu number (e.g. 0-7)
Under this directory, there are a number of files of interest. They should be self explanatory.
In particular, look at scaling_governor. It has either ondemand [kernel will adjust as needed] or performance [kernel will force maximum CPU speed].
To force maximum speed, as root, set this [once] to performance (e.g.):
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Do this for all cpus.
However, I just did this on my system, and it had little effect. So, the kernel's heuristic may have improved.
As to the 400us, when a process has been waiting on something, when it is "woken up", this is a two step process.
The process is marked "runnable".
At some point, the system/CPU does a reschedule. The process will be run, based upon the scheduling policy and the process priority in effect.
For many syscalls, the reschedule [only] occurs on the next system timer/clock tick/interrupt. So, for some, there can be a delay of up to a full clock tick (i.e.) for HZ value of 1000, this can be up to 1ms (1000 us) later.
On average, this is one half of HZ or 500 us.
For some syscalls, when the process is marked runnable, a reschedule is done immediately. If the process has a higher priority, it will be run immediately.
When I first looked at this [circa 2004], I looked at all code paths in the kernel, and the only syscall that did the immediate reschedule was SysV IPC, for msgsnd/msgrcv. That is, when process A did msgsnd, any process B waiting for the given message would be run.
But, others did not (e.g. futex). They would wait for the timer tick. A lot has changed since then, and now, more syscalls will do the immediate reschedule. For example, I recently measured futex [invoked via pthread_mutex_*], and it seemed to do the quick reschedule.
Also, the kernel scheduler has changed. The newer scheduler can wakeup/run some things on a fraction of a clock tick.
So, for you, the 400 us, is [possibly] the alignment to the next clock tick.
But, it could just be the overhead of doing the syscall. To test that, I modified my test program to open /dev/null [and/or /dev/zero], and added read(fd,buf,1) to the test loop.
I got a MIN: value of 529 us. So, the delay you're getting could just be the amount of time it takes to do the task switch.
This is what I would call "good enough for now".
To get "razor's edge" response, you'd probably have to write a custom kernel driver and have the driver do this. This is what embedded systems would do if (e.g.) they had to toggle a GPIO pin on every interval.
But, if all you're doing is printf, the overhead of printf and the underlying write(1,...) tends to swamp the actual delay.
Also, note that when you do printf, it builds the output buffer and when the buffer in FILE *stdout is full, it flushes via write.
For best performance, it's better to do int len = sprintf(buf,"current time is ..."); write(1,buf,len);
Also, when you do this, if the kernel buffers for TTY I/O get filled [which is quite possible given the high frequency of messages you're doing], the process will be suspended until the I/O has been sent to the TTY device.
To do this well, you'd have to watch how much space is available, and skip some messages if there isn't enough space to [wholy] contain them.
You'd need to do: ioctl(1,TIOCOUTQ,...) to get the available space and skip some messages if it is less than the size of the message you want to output (e.g. the len value above).
For your usage, you're probably more interested in the latest time message, rather than outputting all messages [which would eventually produce a lag]
My problem is that turning on and off my GPIO pin takes way too long, despite using good timekeeping functionality, including both ndelay from linux/delay.h and my own accurate_ndelay which (shown below) uses ktime_get_ns() from linux/ktime.h.
My kernel version is 4.19.38 with Armbian, running on an OrangePi Zero.
static inline void accurate_ndelay(uint16_t ns){
uint64_t s = ktime_get_ns();
uint64_t e = s + ns;
while(ktime_get_ns() < e);
}
static inline void unsafe_bit2812(struct WS2812* ws2812, uint8_t b){
if(b){
gpio_set_value(ws2812->pin, 1);
accurate_ndelay(ws2812->t0h);
gpio_set_value(ws2812->pin, 0);
accurate_ndelay(ws2812->t0l);
} else {
gpio_set_value(ws2812->pin, 1);
accurate_ndelay(ws2812->t1h);
gpio_set_value(ws2812->pin, 0);
accurate_ndelay(ws2812->t1l);
}
}
When I measure the real-world delay (as shown by my oscilloscope, not bad software). The delay is not the expected 350ns, but 920ns. Which for the WS2812 is 770ns too much!
That's some pretty tight timing. The OrangePi Zero run at 1.2 GHz, so 150 ns is 180 clock cycles. That doesn't give you time to do much.
The first thing to do is use ktime_get_ns() to just measure how long the gpio_set_value() call takes. Or remove the delay and measure it with the scope. You might know the answer already, if you delayed for 350ns and measured 920ns, then it takes about 600ns.
You're calling gpio_set_value(), which pulls in the safe, portable Linux gpio library. The maximum possible performance would be to write your own driver for the GPIO that goes right to the HW registers and sets the two states, with the delay, as a single action.
Even with a custom driver, you'll have delays introduced by the clock driving the gpio and the rise and fall times of the device.
I need to add a delay into my code of n CPU cycles (~30).
My current solution is the one below, which works but isn't very elegant.
Also, the delay has to be known at compile time. I can work with this, but it would be ideal if I could change the delay at runtime.
(It is OK if there is some overhead, but I need the 1 cycle resolution.)
I do not have any peripheral timers left, that I could use, so it needs to be a software solution.
do_something();
#define NUMBER_OF_NOPS (SOME_DELAY + 3)
#include "nops.h"
#undef NUMBER_OF_NOPS
do_the_next_thing();
nops.h:
#if NUMBER_OF_NOPS > 0
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 1
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 2
__ASM volatile ("nop");
#endif
...
In the cortex devices NOP is something which literally means nothing. There is no guarantee that the NOP will consume any time.They are used for padding only. I you will have several consecutive NOPs they will just be flushed from the pipeline.
For more information refer to the Cortex-M0 documentation. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CHDJJGFB.html
software delays are quite tricky in the Cortex devices and you should use other instructions + possibly barrier instructions instead.
use ISB instructions 4 clocks + flash access time which depend what speed the core is running. For very precise delays place this part of code in the SRAM
Edit: There is a better answer from another SO Q&A here. However it is in assembly, AFAIK using a counter like SysTick is the only way to guarantee any semblance of cycle accuracy.
Edit 2: To avoid a counter overflow, which would result in a very, very long delay, clear the SysTick counter before use, ie. SysTick->VAL = 0;
Original:
Cortex-Ms have a built in timer called SysTick which can be used for cycle accurate timing purposes.
First enable the timer:
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk |
SysTick_CTRL_ENABLE_Msk;
Then you can read the current count using the VAL register. You can then implement a cycle accurate delay this way:
int count = SysTick->VAL;
while(SysTick->VAL < (count+30));
Note that this will introduce some overhead because of the load, compare and branch in the loop so the final cycle count will be a little off, no more than a few ticks in my estimation.
You can use a free-running up-counter as follows:
uint32_t t = <periph>.count;
while ((<periph>.count - t) < delay);
As long as delay is less than half the period of the counter, this is unaffected by wrapping of the counter value - the unsigned arithmetic produces the correct time delta.
Note that since you don't need to control the counter's value in any way, you can use any such counter in the system - even if it's being used for another purpose (as long, of course, as it really is running continuously and freely, and at a rate that gives you the timing resolution that you require).
I am trying to create a short tune using timers on the 8051. I am trying to send a square wave with a specified frequency to create the notes.
However, with my current code all I am getting is one infinite note, that never stops playing. Any help figuring out how to stop the note, and create a duration function would be greatly appreciated.
#include<reg932.h>
sbit speaker=P1^7;
void tone(unsigned char, unsigned char);
void main()
{
P1M1 = 0;
P1M2 = 0;
tone(0xC8, 0xF3);
}
void tone(unsigned char highval, unsigned char lowval)
{
TMOD=0x01;
TL0=lowval;
TH0=highval;
TR0=1;
while(TF0==0);
speaker=0;
TR0=0;
TF0=0;
}
I haven't programmed 8051's devices in a long time, but here's what I'd do:
1.a. figure out if it's tone() that never exits
1.b. if it is, I'd make sure that the while loop is indeed in there (see the disassembly of tone()), if it's not, the compiler optimizes the check out and it needs fixing (e.g. declaring TF0 as volatile)
1.c. see if the check is correct (the right bit in the right register, etc)
write an assembly routine to waste N CPU clocks, use the slowest instruction (was MUL or DIV the slowest?) in a loop or simply repeated M times so you get like a 10 ms delay or something, write a C function to call that routine as many times as necessary (e.g. 100 times for 1 second). (You could use a timer here, but this may be the simplest)
Okay, so I've got some C code to perform a mathematical operation which could, pretty much, take any length of time (depending on the operands supplied to it, of course). I was wondering if there is a way to register some kind of method which will be called every n seconds which can analyse the state of the operation, i.e. what iteration it is currently at, possibly using a hardware timer interrupt or something?
The reason I ask this is because I know the common way to implement this is to be keeping track of the current iteration in a variable; say, an integer called progress and have an IF statement like this in the code:
if ((progress % 10000) == 0)
printf("Currently at iteration %d\n", progress);
but I believe that a mod operation takes a relatively long time to execute, so the idea of having it inside a loop which will be ran many, many times scares me, from an optimisation point of view.
So I get the feeling that having an external way of signalling a progress print is nice and efficient. Are there any great ways to perform this, or is the simple 'mod check' the best (in terms of optimising)?
I'd go with the mod check, but maybe with subtractions instead :-)
icount = 0;
progress = 10000;
/* ... */
if (--progress == 0) {
progress = 10000;
printf("Currently at iteration %d0000\n", ++icount);
}
/* ... */
While mod operations are usually slow, the compiler should be able to optimize and predict this really well and only mis-predict once ever 10'000 ifs, burning one mod operation and ~20 cycles (for the mis-prediction) on it, which is fine. So you are trying to optimize one mod operation every 10'000 iterations. Of course this assumes you are running it on a modern and typical CPU, and not some embedded system with unknown specs. This should even be faster than having a counter variable.
Suggestion: Test it with and without the timing code, and figure out a complex solution if there is really a problem.
Premature optimisation is the root of all evil. -Knuth
mod is about the same speed as division, on most CPU's these days that means about 5-10 cycles... in other words hardly anything, slower than multiply/add/subtract, but not enough to really worry about.
However you are right to want to avoid sting in a loop spinning if you're doing work in another thread or something like that, if you're on a unixish system there's timer_create() or on linux the much easier to use timerfd_create()
But for single threaded, just putting that if in is enough.
Use alarm setitimer to raise SIGALRM signals at regular intervals.
struct itimerval interval;
void handler( int x ) {
write( STDOUT_FILENO, ".", 1 ); /* Defined in POSIX, not in C */
}
int main() {
signal( SIGALRM, &handler );
interval.it_value.tv_sec = 5; /* display after 5 seconds */
interval.it_interval.tv_sec = 5; /* then display every 5 seconds */
setitimer( ITIMER_REAL, &interval, NULL );
/* do computations */
interval.it_interval.tv_sec = 0; /* don't display progress any more */
setitimer( ITIMER_REAL, &interval, NULL );
printf( "\n" ); /* done with the dots! */
}
Note, only a smattering of functions are OK to call inside handler. They are listed partway down this page. If you want to communicate anything for a fancier printout, do it through a sig_atomic_t variable.
you could have a global variable for the iterations, which you could monitor from an external thread.
While () {
Print(iteration);
Sleep(1000);
}
You may need to watch out for data races though.