What is the best way to create periodic Linux threads in C - c

For my application I have the requirement of accurate periodic threads with relative low cycle times (500 µs).
In particular the application is a run time system of a
PLC.
It's purpose is to run an application developed by the PLC user.
Such applications are organised in programs and periodic tasks - each task with it's own cycle time and priority.
Usually the application runs on systems with real time OSs (eg. vxWorks or Linux with RT patch).
Currently the periodic tasks are implemented via clock_nanosleep.
Unfortunately the actual sleep time of clock_nanosleep is disturbed by other threads - even with lower priority.
Once every second, the sleep time is exceeded by about 50 ms.
I've observed this on Debian 9.5, on RaspberryPi and also on an ARM-Linux with Preemt-RT.
Here's a sample, which shows this behavior:
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
typedef void* ThreadFun(void* param);
#define SCHEDULER_POLICY SCHED_FIFO
#define CLOCK CLOCK_MONOTONIC
#define INTERVAL_NS (10 * 1000 * 1000)
static long tickCnt = 0;
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void updateWakeTime(struct timespec* time)
{
uint64_t nanoSec = time->tv_nsec;
struct timespec currentTime;
clock_gettime(CLOCK, &currentTime);
while (calcTimeDiff(time, &currentTime) <= 0)
{
nanoSec = time->tv_nsec;
nanoSec += INTERVAL_NS;
time->tv_nsec = nanoSec % 1000000000;
time->tv_sec += nanoSec / 1000000000;
}
}
static void* tickThread(void *param)
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 2;
wakeTime.tv_nsec = 0;
while (1)
{
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, &currentTime);
sleepTime = calcTimeDiff(&currentTime, &sleepStart);
wakeDelay = calcTimeDiff(&currentTime, &wakeTime);
if (wakeDelay > INTERVAL_NS)
{
printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);
}
tickCnt += 1;
updateWakeTime(&wakeTime);
}
}
static void* workerThread(void *param)
{
while (1)
{
}
}
static int createThread(char const* funcName, ThreadFun* func, int prio)
{
pthread_t tid = 0;
pthread_attr_t threadAttr;
struct sched_param schedParam;
printf("thread create func=%s prio=%d\n", funcName, prio);
pthread_attr_init(&threadAttr);
pthread_attr_setschedpolicy(&threadAttr, SCHEDULER_POLICY);
pthread_attr_setinheritsched(&threadAttr, PTHREAD_EXPLICIT_SCHED);
schedParam.sched_priority = prio;
pthread_attr_setschedparam(&threadAttr, &schedParam);
if (pthread_create(&tid, &threadAttr, func, NULL) != 0)
{
return -1;
}
printf("thread created func=%s prio=%d\n", funcName, prio);
return 0;
}
#define CREATE_THREAD(func,prio) createThread(#func,func,prio)
int main(int argc, char*argv[])
{
int minPrio = sched_get_priority_min(SCHEDULER_POLICY);
int maxPrio = sched_get_priority_max(SCHEDULER_POLICY);
int prioRange = maxPrio - minPrio;
CREATE_THREAD(tickThread, maxPrio);
CREATE_THREAD(workerThread, minPrio + prioRange / 4);
sleep(10);
printf("%ld ticks\n", tickCnt);
}
Is something wrong in my code sample?
Is there a better (more reliable) way to create periodic threads?

For my application I have the requirement of accurate periodic threads with relative low cycle times (500 µs)
Probably too strong requirement. Linux is not a hard real-time OS.
I would suggest to have fewer threads (perhaps a small fixed set -only 2 or 3, organized in a thread pool; see this for an explanation, remembering that a RasberryPi3B+ has only 4 cores). You might prefer a single thread (think of a design around an event loop, inspired by continuation-passing style).
You probably don't need periodic threads. You need some periodic activity. They all might happen in the same thread. (the kernel is rescheduling tasks perhaps every 50 or 100 ms, even if it is capable of sleeping a smaller time, and if tasks get rescheduled very frequently -e.g. every millisecond- , their scheduling has a cost).
So read carefully time(7).
Consider using timer_create(2), or even better timerfd_create(2) used in an event loop around poll(2).
On a RaspberryPi, you won't have guaranteed 500µs delays. This is probably impossible (the hardware might not be powerful enough, and the Linux OS is not hard real-time). I feel your expectations are not reasonable.

Related

Sampling Loop coded in C is inaccurate

I have a piece of code which I am using to create a sampling loop with high accuracy. It is in c and runs in raspberry pi 3.
The sampling loop specifies 5ms and is printing 5ms. But occasionally, I am seeing number other than 5ms printed on the terminal. It does not happen often, but it is still a crucial issue to my application.
Attached is a video of what I am saying. It happen around the 40 seconds mark.
Sampling Loop Occassionally show number other than 5ms:
https://youtu.be/SNcLf3Zg3_I?t=40
I would like to get some help to debug the issue. Is there something wrong with the code?
The code is as below:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#define DT 0.005 //5ms
int mymillis();
int timeval_subtract(struct timeval *result, struct timeval *t2, struct timeval *t1);
int main() {
int startInt = mymillis();
struct timeval tvBegin, tvEnd, tvDiff;
gettimeofday(&tvBegin, NULL);
while (1)
{
startInt = mymillis();
//Each loop should be at least 5ms.
while (mymillis() - startInt < (DT * 1000))
{
usleep(100);
}
printf("Loop Time %d\n", mymillis() - startInt);
}
return 0;
}
int mymillis()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
}
int timeval_subtract(struct timeval *result, struct timeval *t2, struct timeval *t1)
{
long int diff = (t2->tv_usec + 1000000 * t2->tv_sec) - (t1->tv_usec + 1000000 * t1->tv_sec);
result->tv_sec = diff / 1000000;
result->tv_usec = diff % 1000000;
return (diff<0);
}
`
Thank You.
man usleep specifies:
The usleep() function suspends execution of the calling thread for (at least) usec microseconds. The sleep may be lengthened slightly by any system activity or by the time spent processing the call or by the granularity of system timers.
There's no guarantee that the sleep will last exactly that longer. Even less can it be guaranteed that the two time samples taken with usleep in between will stand at exactly 5ms distance from one another.
Additionally you might sample the lengths of 5ms valued-runs to see if oversleeps happen randomly which would probably finally close the case.

Run code for x amount of time

To preface, I am on a Unix (linux) system using gcc.
What I am stuck on is how to accurately implement a way to run a section of code for a certain amount of time.
Here is an example of something I have been working with:
struct timeb start, check;
int64_t duration = 10000;
int64_t elapsed = 0;
ftime(&start);
while ( elapsed < duration ) {
// do a set of tasks
ftime(&check);
elapsed += ((check.time - start.time) * 1000) + (check.millitm - start.millitm);
}
I was thinking this would have carried on for 10000ms or 10 seconds, but it didn't, almost instantly. I was basing this off other questions such as How to get the time elapsed in C in milliseconds? (Windows) . But then I thought that if upon the first call of ftime, the struct is time = 1, millitm = 999 and on the second call time = 2, millitm = 01 it would be calculating the elapsed time as being 1002 milliseconds. Is there something I am missing?
Also the suggestions in the various stackoverflow questions, ftime() and gettimeofday(), are listed as deprecated or legacy.
I believe I could convert the start time into milliseconds, and the check time into millseconds, then subtract start from check. But milliseconds since the epoch requires 42 bits and I'm trying to keep everything in the loop as efficient as possible.
What approach could I take towards this?
Code is incorrect calculating elapsed time.
// elapsed += ((check.time - start.time) * 1000) + (check.millitm - start.millitm);
elapsed = ((check.time - start.time) * (int64_t)1000) + (check.millitm - start.millitm);
There is some concern about check.millitm - start.millitm. On systems with struct timeb *tp, it can be expected that the millitm will be promoted to int before subtraction occurs. So the difference will be in the range [-1000 ... 1000].
struct timeb {
time_t time;
unsigned short millitm;
short timezone;
short dstflag;
};
IMO, more robust code would handle ms conversion in a separate helper function. This matches OP's "I believe I could convert the start time into milliseconds, and the check time into millseconds, then subtract start from check."
int64_t timeb_to_ms(struct timeb *t) {
return (int64_t)t->time * 1000 + t->millitm;
}
struct timeb start, check;
ftime(&start);
int64_t start_ms = timeb_to_ms(&start);
int64_t duration = 10000 /* ms */;
int64_t elapsed = 0;
while (elapsed < duration) {
// do a set of tasks
struct timeb check;
ftime(&check);
elapsed = timeb_to_ms(&check) - start_ms;
}
If you want efficiency, let the system send you a signal when a timer expires.
Traditionally, you can set a timer with a resolution in seconds with the alarm(2) syscall.
The system then sends you a SIGALRM when the timer expires. The default disposition of that signal is to terminate.
If you handle the signal, you can longjmp(2) from the handler to another place.
I don't think it gets much more efficient than SIGALRM + longjmp (with an asynchronous timer, your code basically runs undisturbed without having to do any extra checks or calls).
Below is an example for you:
#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
static jmp_buf jmpbuf;
void hndlr();
void loop();
int main(){
/*sisv_signal handlers get reset after a signal is caught and handled*/
if(SIG_ERR==sysv_signal(SIGALRM,hndlr)){
perror("couldn't set SIGALRM handler");
return 1;
}
/*the handler will jump you back here*/
setjmp(jmpbuf);
if(0>alarm(3/*seconds*/)){
perror("couldn't set alarm");
return 1;
}
loop();
return 0;
}
void hndlr(){
puts("Caught SIGALRM");
puts("RESET");
longjmp(jmpbuf,1);
}
void loop(){
int i;
for(i=0; ; i++){
//print each 100-milionth iteration
if(0==i%100000000){
printf("%d\n", i);
}
}
}
If alarm(2) isn't enough, you can use timer_create(2) as EOF suggests.

Variable performance of busy wait loop?

I am evaluating the performance of a busy wait loop for firing events at consistent intervals. I have noticed some odd behavior using the following code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
int timespec_subtract(struct timespec *, struct timespec, struct timespec);
int main(int argc, char *argv[]) {
int iterations = atoi(argv[1])+1;
struct timespec t[2], diff;
for (int i = 0; i < iterations; i++) {
clock_gettime(CLOCK_MONOTONIC, &t[0]);
static volatile int i;
for (i = 0; i < 200000; i++)
;
clock_gettime(CLOCK_MONOTONIC, &t[1]);
timespec_subtract(&diff, t[1], t[0]);
printf("%ld\n", diff.tv_sec * 1000000000 + diff.tv_nsec);
}
}
On the test machine (dual 14-core E5-2683 v3 # 2.00Ghz, 256GB DDR4), 200k iterations of the for loop is approximately 1ms. Or maybe not:
1030854
1060237
1012797
1011479
1025307
1017299
1011001
1038725
1017361
... (about 700 lines later)
638466
638546
638446
640422
638468
638457
638468
638398
638493
640242
... (about 200 lines later)
606460
607013
606449
608813
606542
606484
606990
606436
606491
606466
... (about 3000 lines later)
404367
404307
404309
404306
404270
404370
404280
404395
404342
406005
When the times shift down the third time, they stay mostly consistent (within about 2 or 3 microseconds), except for occasionally jumping up to about 450us for a few hundred iterations. This behavior is repeatable on similar machines and over many runs.
I understand that busy loops can be optimized out by the compiler, but I don't think that's the issue here. I don't think cache should be affecting it, because no invalidation should be taking place, and wouldn't explain the sudden optimization. I also tried using a register int for the loop counter, with no noticeable effect.
Any thoughts on what is going on, and how to make this (more) consistent?
EDIT: For information, running this program with usleep, nanosleep, or the shown busy wait for 10k iterations all show ~20000 involuntary context switches with time -v.
I'd make 2 points
- Due to context swtiching sleep/usleep may sleep for more time than expected
- Moreover if there is some higher priority task like interrupts, there may come a situation when sleep may not be executed at all.
Thus if you want exact delay in your application you can use gettimeofday to calculate the time gap which can be subtracted from the delay in sleep/usleep call
One big issue with busy waiting is that, besides using up CPU resources, the amount of time you wait will be highly dependent on the CPU block speed. So the same loop can run for wildly different times on different machines.
The problem with any method of sleeping is that due to OS scheduling you may end up sleeping for longer than intended. The man pages for nanosleep says that it will use the rem argument to tell you the remaining time in case you received a signal, but it says nothing about waiting too long.
You need to grab the timestamp after each call to usleep so you know how long you actually slept for. If you slept too short, you add the deficit. If you slept too long, you subtract the overage.
Here's an example of how I did this in UFTP, a multicast file transfer application, in order to send packets at a consistent speed:
int64_t diff_usec(struct timeval t2, struct timeval t1)
{
return (t2.tv_usec - t1.tv_usec) +
(int64_t)1000000 * (t2.tv_sec - t1.tv_sec);
}
...
int32_t packet_wait = 10000;
int64_t overage = 0, tdiff;
struct timeval current_sent, last_sent;
gettimeofday(&last_sent, NULL);
while(...) {
...
if (packet_wait > overage) {
usleep(packet_wait - (int32_t)overage);
}
gettimeofday(&current_sent, NULL);
tdiff = diff_usec(current_sent, last_sent);
overage += tdiff - packet_wait;
last_sent = current_sent;
...
}

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.
Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.
It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

clock_gettime alternative in Mac OS X

When compiling a program I wrote on Mac OS X after installing the necessary libraries through MacPorts, I get this error:
In function 'nanotime':
error: 'CLOCK_REALTIME' undeclared (first use in this function)
error: (Each undeclared identifier is reported only once
error: for each function it appears in.)
It appears that clock_gettime is not implemented in Mac OS X. Is there an alternative means of getting the epoch time in nanoseconds? Unfortunately gettimeofday is in microseconds.
After hours of perusing different answers, blogs, and headers, I found a portable way to get the current time:
#include <time.h>
#include <sys/time.h>
#ifdef __MACH__
#include <mach/clock.h>
#include <mach/mach.h>
#endif
struct timespec ts;
#ifdef __MACH__ // OS X does not have clock_gettime, use clock_get_time
clock_serv_t cclock;
mach_timespec_t mts;
host_get_clock_service(mach_host_self(), CALENDAR_CLOCK, &cclock);
clock_get_time(cclock, &mts);
mach_port_deallocate(mach_task_self(), cclock);
ts.tv_sec = mts.tv_sec;
ts.tv_nsec = mts.tv_nsec;
#else
clock_gettime(CLOCK_REALTIME, &ts);
#endif
or check out this gist: https://gist.github.com/1087739
Hope this saves someone time. Cheers!
None of the solutions above answers the question. Either they don't give you absolute Unix time, or their accuracy is 1 microsecond. The most popular solution by jbenet is slow (~6000ns) and does not count in nanoseconds even though its return suggests so. Below is a test for 2 solutions suggested by jbenet and Dmitri B, plus my take on this. You can run the code without changes.
The 3rd solution does count in nanoseconds and gives you absolute Unix time reasonably fast (~90ns). So if someone find it useful - please let us all know here :-). I will stick to the one from Dmitri B (solution #1 in the code) - it fits my needs better.
I needed commercial quality alternative to clock_gettime() to make pthread_…timed.. calls, and found this discussion very helpful. Thanks guys.
/*
Ratings of alternatives to clock_gettime() to use with pthread timed waits:
Solution 1 "gettimeofday":
Complexity : simple
Portability : POSIX 1
timespec : easy to convert from timeval to timespec
granularity : 1000 ns,
call : 120 ns,
Rating : the best.
Solution 2 "host_get_clock_service, clock_get_time":
Complexity : simple (error handling?)
Portability : Mac specific (is it always available?)
timespec : yes (struct timespec return)
granularity : 1000 ns (don't be fooled by timespec format)
call time : 6000 ns
Rating : the worst.
Solution 3 "mach_absolute_time + gettimeofday once":
Complexity : simple..average (requires initialisation)
Portability : Mac specific. Always available
timespec : system clock can be converted to timespec without float-math
granularity : 1 ns.
call time : 90 ns unoptimised.
Rating : not bad, but do we really need nanoseconds timeout?
References:
- OS X is UNIX System 3 [U03] certified
http://www.opengroup.org/homepage-items/c987.html
- UNIX System 3 <--> POSIX 1 <--> IEEE Std 1003.1-1988
http://en.wikipedia.org/wiki/POSIX
http://www.unix.org/version3/
- gettimeofday() is mandatory on U03,
clock_..() functions are optional on U03,
clock_..() are part of POSIX Realtime extensions
http://www.unix.org/version3/inttables.pdf
- clock_gettime() is not available on MacMini OS X
(Xcode > Preferences > Downloads > Command Line Tools = Installed)
- OS X recommends to use gettimeofday to calculate values for timespec
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/pthread_cond_timedwait.3.html
- timeval holds microseconds, timespec - nanoseconds
http://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html
- microtime() is used by kernel to implement gettimeofday()
http://ftp.tw.freebsd.org/pub/branches/7.0-stable/src/sys/kern/kern_time.c
- mach_absolute_time() is really fast
http://www.opensource.apple.com/source/Libc/Libc-320.1.3/i386/mach/mach_absolute_time.c
- Only 9 deciaml digits have meaning when int nanoseconds converted to double seconds
Tutorial: Performance and Time post uses .12 precision for nanoseconds
http://www.macresearch.org/tutorial_performance_and_time
Example:
Three ways to prepare absolute time 1500 milliseconds in the future to use with pthread timed functions.
Output, N = 3, stock MacMini, OSX 10.7.5, 2.3GHz i5, 2GB 1333MHz DDR3:
inittime.tv_sec = 1390659993
inittime.tv_nsec = 361539000
initclock = 76672695144136
get_abs_future_time_0() : 1390659994.861599000
get_abs_future_time_0() : 1390659994.861599000
get_abs_future_time_0() : 1390659994.861599000
get_abs_future_time_1() : 1390659994.861618000
get_abs_future_time_1() : 1390659994.861634000
get_abs_future_time_1() : 1390659994.861642000
get_abs_future_time_2() : 1390659994.861643671
get_abs_future_time_2() : 1390659994.861643877
get_abs_future_time_2() : 1390659994.861643972
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h> /* gettimeofday */
#include <mach/mach_time.h> /* mach_absolute_time */
#include <mach/mach.h> /* host_get_clock_service, mach_... */
#include <mach/clock.h> /* clock_get_time */
#define BILLION 1000000000L
#define MILLION 1000000L
#define NORMALISE_TIMESPEC( ts, uint_milli ) \
do { \
ts.tv_sec += uint_milli / 1000u; \
ts.tv_nsec += (uint_milli % 1000u) * MILLION; \
ts.tv_sec += ts.tv_nsec / BILLION; \
ts.tv_nsec = ts.tv_nsec % BILLION; \
} while (0)
static mach_timebase_info_data_t timebase = { 0, 0 }; /* numer = 0, denom = 0 */
static struct timespec inittime = { 0, 0 }; /* nanoseconds since 1-Jan-1970 to init() */
static uint64_t initclock; /* ticks since boot to init() */
void init()
{
struct timeval micro; /* microseconds since 1 Jan 1970 */
if (mach_timebase_info(&timebase) != 0)
abort(); /* very unlikely error */
if (gettimeofday(&micro, NULL) != 0)
abort(); /* very unlikely error */
initclock = mach_absolute_time();
inittime.tv_sec = micro.tv_sec;
inittime.tv_nsec = micro.tv_usec * 1000;
printf("\tinittime.tv_sec = %ld\n", inittime.tv_sec);
printf("\tinittime.tv_nsec = %ld\n", inittime.tv_nsec);
printf("\tinitclock = %ld\n", (long)initclock);
}
/*
* Get absolute future time for pthread timed calls
* Solution 1: microseconds granularity
*/
struct timespec get_abs_future_time_coarse(unsigned milli)
{
struct timespec future; /* ns since 1 Jan 1970 to 1500 ms in the future */
struct timeval micro = {0, 0}; /* 1 Jan 1970 */
(void) gettimeofday(&micro, NULL);
future.tv_sec = micro.tv_sec;
future.tv_nsec = micro.tv_usec * 1000;
NORMALISE_TIMESPEC( future, milli );
return future;
}
/*
* Solution 2: via clock service
*/
struct timespec get_abs_future_time_served(unsigned milli)
{
struct timespec future;
clock_serv_t cclock;
mach_timespec_t mts;
host_get_clock_service(mach_host_self(), CALENDAR_CLOCK, &cclock);
clock_get_time(cclock, &mts);
mach_port_deallocate(mach_task_self(), cclock);
future.tv_sec = mts.tv_sec;
future.tv_nsec = mts.tv_nsec;
NORMALISE_TIMESPEC( future, milli );
return future;
}
/*
* Solution 3: nanosecond granularity
*/
struct timespec get_abs_future_time_fine(unsigned milli)
{
struct timespec future; /* ns since 1 Jan 1970 to 1500 ms in future */
uint64_t clock; /* ticks since init */
uint64_t nano; /* nanoseconds since init */
clock = mach_absolute_time() - initclock;
nano = clock * (uint64_t)timebase.numer / (uint64_t)timebase.denom;
future = inittime;
future.tv_sec += nano / BILLION;
future.tv_nsec += nano % BILLION;
NORMALISE_TIMESPEC( future, milli );
return future;
}
#define N 3
int main()
{
int i, j;
struct timespec time[3][N];
struct timespec (*get_abs_future_time[])(unsigned milli) =
{
&get_abs_future_time_coarse,
&get_abs_future_time_served,
&get_abs_future_time_fine
};
init();
for (j = 0; j < 3; j++)
for (i = 0; i < N; i++)
time[j][i] = get_abs_future_time[j](1500); /* now() + 1500 ms */
for (j = 0; j < 3; j++)
for (i = 0; i < N; i++)
printf("get_abs_future_time_%d() : %10ld.%09ld\n",
j, time[j][i].tv_sec, time[j][i].tv_nsec);
return 0;
}
In effect, it seems not to be implemented for macOS before Sierra 10.12. You may want to look at this blog entry. The main idea is in the following code snippet:
#include <mach/mach_time.h>
#define ORWL_NANO (+1.0E-9)
#define ORWL_GIGA UINT64_C(1000000000)
static double orwl_timebase = 0.0;
static uint64_t orwl_timestart = 0;
struct timespec orwl_gettime(void) {
// be more careful in a multithreaded environement
if (!orwl_timestart) {
mach_timebase_info_data_t tb = { 0 };
mach_timebase_info(&tb);
orwl_timebase = tb.numer;
orwl_timebase /= tb.denom;
orwl_timestart = mach_absolute_time();
}
struct timespec t;
double diff = (mach_absolute_time() - orwl_timestart) * orwl_timebase;
t.tv_sec = diff * ORWL_NANO;
t.tv_nsec = diff - (t.tv_sec * ORWL_GIGA);
return t;
}
#if defined(__MACH__) && !defined(CLOCK_REALTIME)
#include <sys/time.h>
#define CLOCK_REALTIME 0
// clock_gettime is not implemented on older versions of OS X (< 10.12).
// If implemented, CLOCK_REALTIME will have already been defined.
int clock_gettime(int /*clk_id*/, struct timespec* t) {
struct timeval now;
int rv = gettimeofday(&now, NULL);
if (rv) return rv;
t->tv_sec = now.tv_sec;
t->tv_nsec = now.tv_usec * 1000;
return 0;
}
#endif
Everything you need is described in Technical Q&A QA1398: Technical Q&A QA1398: Mach Absolute Time Units, basically the function you want is mach_absolute_time.
Here's a slightly earlier version of the sample code from that page that does everything using Mach calls (the current version uses AbsoluteToNanoseconds from CoreServices). In current OS X (i.e., on Snow Leopard on x86_64) the absolute time values are actually in nanoseconds and so don't actually require any conversion at all. So, if you're good and writing portable code, you'll convert, but if you're just doing something quick and dirty for yourself, you needn't bother.
FWIW, mach_absolute_time is really fast.
uint64_t GetPIDTimeInNanoseconds(void)
{
uint64_t start;
uint64_t end;
uint64_t elapsed;
uint64_t elapsedNano;
static mach_timebase_info_data_t sTimebaseInfo;
// Start the clock.
start = mach_absolute_time();
// Call getpid. This will produce inaccurate results because
// we're only making a single system call. For more accurate
// results you should call getpid multiple times and average
// the results.
(void) getpid();
// Stop the clock.
end = mach_absolute_time();
// Calculate the duration.
elapsed = end - start;
// Convert to nanoseconds.
// If this is the first time we've run, get the timebase.
// We can use denom == 0 to indicate that sTimebaseInfo is
// uninitialised because it makes no sense to have a zero
// denominator is a fraction.
if ( sTimebaseInfo.denom == 0 ) {
(void) mach_timebase_info(&sTimebaseInfo);
}
// Do the maths. We hope that the multiplication doesn't
// overflow; the price you pay for working in fixed point.
elapsedNano = elapsed * sTimebaseInfo.numer / sTimebaseInfo.denom;
printf("multiplier %u / %u\n", sTimebaseInfo.numer, sTimebaseInfo.denom);
return elapsedNano;
}
Note that macOS Sierra 10.12 now supports clock_gettime():
#include <stdio.h>
#include <time.h>
int main() {
struct timespec res;
struct timespec time;
clock_getres(CLOCK_REALTIME, &res);
clock_gettime(CLOCK_REALTIME, &time);
printf("CLOCK_REALTIME: res.tv_sec=%lu res.tv_nsec=%lu\n", res.tv_sec, res.tv_nsec);
printf("CLOCK_REALTIME: time.tv_sec=%lu time.tv_nsec=%lu\n", time.tv_sec, time.tv_nsec);
}
It does provide nanoseconds; however, the resolution is 1000, so it is (in)effectively limited to microseconds:
CLOCK_REALTIME: res.tv_sec=0 res.tv_nsec=1000
CLOCK_REALTIME: time.tv_sec=1475279260 time.tv_nsec=525627000
You will need XCode 8 or later to be able to use this feature. Code compiled to use this feature will not run on versions of Mac OS X (10.11 or earlier).
Thanks for your posts
I think you can add the following lines
#ifdef __MACH__
#include <mach/mach_time.h>
#define CLOCK_REALTIME 0
#define CLOCK_MONOTONIC 0
int clock_gettime(int clk_id, struct timespec *t){
mach_timebase_info_data_t timebase;
mach_timebase_info(&timebase);
uint64_t time;
time = mach_absolute_time();
double nseconds = ((double)time * (double)timebase.numer)/((double)timebase.denom);
double seconds = ((double)time * (double)timebase.numer)/((double)timebase.denom * 1e9);
t->tv_sec = seconds;
t->tv_nsec = nseconds;
return 0;
}
#else
#include <time.h>
#endif
Let me know what you get for latency and granularity
Maristic has the best answer here to date. Let me simplify and add a remark. #include and Init():
#include <mach/mach_time.h>
double conversion_factor;
void Init() {
mach_timebase_info_data_t timebase;
mach_timebase_info(&timebase);
conversion_factor = (double)timebase.numer / (double)timebase.denom;
}
Use as:
uint64_t t1, t2;
Init();
t1 = mach_absolute_time();
/* profiled code here */
t2 = mach_absolute_time();
double duration_ns = (double)(t2 - t1) * conversion_factor;
Such timer has latency of 65ns +/- 2ns (2GHz CPU). Use this if you need "time evolution" of single execution. Otherwise loop your code 10000 times and profile even with gettimeofday(), which is portable (POSIX), and has the latency of 100ns +/- 0.5ns (though only 1us granularity).
I tried the version with clock_get_time, and did cache the host_get_clock_service call. It's way slower than gettimeofday, it takes several microseconds per invocation. And, what's worse, the return value has steps of 1000, i.e. it's still microsecond granularity.
I'd advice to use gettimeofday, and multiply tv_usec by 1000.
Based on the open source mach_absolute_time.c we can see that the line extern mach_port_t clock_port; tells us there's a mach port already initialized for monotonic time. This clock port can be accessed directly without having to resort to calling mach_absolute_time then converting back to a struct timespec. Bypassing a call to mach_absolute_time should improve performance.
I created a small Github repo (PosixMachTiming) with the code based on the extern clock_port and a similar thread. PosixMachTiming emulates clock_gettime for CLOCK_REALTIME and CLOCK_MONOTONIC. It also emulates the function clock_nanosleep for absolute monotonic time. Please give it a try and see how the performance compares. Maybe you might want to create comparative tests or emulate other POSIX clocks/functions?
As of at least as far back as Mountain Lion, mach_absolute_time() returns nanoseconds and not absolute time (which was the number of bus cycles).
The following code on my MacBook Pro (2 GHz Core i7) showed that the time to call mach_absolute_time() averaged 39 ns over 10 runs (min 35, max 45), which is basically the time between the return of the two calls to mach_absolute_time(), about 1 invocation:
#include <stdint.h>
#include <mach/mach_time.h>
#include <iostream>
using namespace std;
int main()
{
uint64_t now, then;
uint64_t abs;
then = mach_absolute_time(); // return nanoseconds
now = mach_absolute_time();
abs = now - then;
cout << "nanoseconds = " << abs << endl;
}
void clock_get_uptime(uint64_t *result);
void clock_get_system_microtime( uint32_t *secs,
uint32_t *microsecs);
void clock_get_system_nanotime( uint32_t *secs,
uint32_t *nanosecs);
void clock_get_calendar_microtime( uint32_t *secs,
uint32_t *microsecs);
void clock_get_calendar_nanotime( uint32_t *secs,
uint32_t *nanosecs);
For MacOS you can find a good information on their developers page
https://developer.apple.com/library/content/documentation/Darwin/Conceptual/KernelProgramming/services/services.html
I found another portable solution.
Declare in some header file (or even in your source one):
/* If compiled on DARWIN/Apple platforms. */
#ifdef DARWIN
#define CLOCK_REALTIME 0x2d4e1588
#define CLOCK_MONOTONIC 0x0
#endif /* DARWIN */
And the add the function implementation:
#ifdef DARWIN
/*
* Bellow we provide an alternative for clock_gettime,
* which is not implemented in Mac OS X.
*/
static inline int clock_gettime(int clock_id, struct timespec *ts)
{
struct timeval tv;
if (clock_id != CLOCK_REALTIME)
{
errno = EINVAL;
return -1;
}
if (gettimeofday(&tv, NULL) < 0)
{
return -1;
}
ts->tv_sec = tv.tv_sec;
ts->tv_nsec = tv.tv_usec * 1000;
return 0;
}
#endif /* DARWIN */
Don't forget to include <time.h>.

Resources