Variable performance of busy wait loop? - c

I am evaluating the performance of a busy wait loop for firing events at consistent intervals. I have noticed some odd behavior using the following code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
int timespec_subtract(struct timespec *, struct timespec, struct timespec);
int main(int argc, char *argv[]) {
int iterations = atoi(argv[1])+1;
struct timespec t[2], diff;
for (int i = 0; i < iterations; i++) {
clock_gettime(CLOCK_MONOTONIC, &t[0]);
static volatile int i;
for (i = 0; i < 200000; i++)
;
clock_gettime(CLOCK_MONOTONIC, &t[1]);
timespec_subtract(&diff, t[1], t[0]);
printf("%ld\n", diff.tv_sec * 1000000000 + diff.tv_nsec);
}
}
On the test machine (dual 14-core E5-2683 v3 # 2.00Ghz, 256GB DDR4), 200k iterations of the for loop is approximately 1ms. Or maybe not:
1030854
1060237
1012797
1011479
1025307
1017299
1011001
1038725
1017361
... (about 700 lines later)
638466
638546
638446
640422
638468
638457
638468
638398
638493
640242
... (about 200 lines later)
606460
607013
606449
608813
606542
606484
606990
606436
606491
606466
... (about 3000 lines later)
404367
404307
404309
404306
404270
404370
404280
404395
404342
406005
When the times shift down the third time, they stay mostly consistent (within about 2 or 3 microseconds), except for occasionally jumping up to about 450us for a few hundred iterations. This behavior is repeatable on similar machines and over many runs.
I understand that busy loops can be optimized out by the compiler, but I don't think that's the issue here. I don't think cache should be affecting it, because no invalidation should be taking place, and wouldn't explain the sudden optimization. I also tried using a register int for the loop counter, with no noticeable effect.
Any thoughts on what is going on, and how to make this (more) consistent?
EDIT: For information, running this program with usleep, nanosleep, or the shown busy wait for 10k iterations all show ~20000 involuntary context switches with time -v.

I'd make 2 points
- Due to context swtiching sleep/usleep may sleep for more time than expected
- Moreover if there is some higher priority task like interrupts, there may come a situation when sleep may not be executed at all.
Thus if you want exact delay in your application you can use gettimeofday to calculate the time gap which can be subtracted from the delay in sleep/usleep call

One big issue with busy waiting is that, besides using up CPU resources, the amount of time you wait will be highly dependent on the CPU block speed. So the same loop can run for wildly different times on different machines.
The problem with any method of sleeping is that due to OS scheduling you may end up sleeping for longer than intended. The man pages for nanosleep says that it will use the rem argument to tell you the remaining time in case you received a signal, but it says nothing about waiting too long.
You need to grab the timestamp after each call to usleep so you know how long you actually slept for. If you slept too short, you add the deficit. If you slept too long, you subtract the overage.
Here's an example of how I did this in UFTP, a multicast file transfer application, in order to send packets at a consistent speed:
int64_t diff_usec(struct timeval t2, struct timeval t1)
{
return (t2.tv_usec - t1.tv_usec) +
(int64_t)1000000 * (t2.tv_sec - t1.tv_sec);
}
...
int32_t packet_wait = 10000;
int64_t overage = 0, tdiff;
struct timeval current_sent, last_sent;
gettimeofday(&last_sent, NULL);
while(...) {
...
if (packet_wait > overage) {
usleep(packet_wait - (int32_t)overage);
}
gettimeofday(&current_sent, NULL);
tdiff = diff_usec(current_sent, last_sent);
overage += tdiff - packet_wait;
last_sent = current_sent;
...
}

Related

Practical jitter with clock_nanosleep()

I'm trying to establish what practical jitter I can achieve by using clock_nanosleep() in a loop and through experimentation I'm observing something I'm not confident I understand.
I'm using code posted in this SO question by another user to benchmark performance, targeting a 250ms interval. I've observed that on my system the sleep function returns very consistently 10us late with only about 2us jitter the vast majority of the time (fairly narrow statistical distribution).
NOTE: I haven't collected data to present a plot of statistical distribution but casual qualitative description should hopefully suffice.
I decided to subtract the 10us offset from the target wakeup time to compensate for it, and this caused the average error to be approximately zero as expected, however the jitter increased dramatically - I would estimate most wakeups are >100us early/late, and much more widely distributed.
Why is this?
My theory is that with the 10us correction the target waketimes are less nicely aligned with the underlying hardware clock, but it would be helpful to get confirmation. If this is true, is there a method to synchronize the phase of the target waketimes with the hardware clock?
Manpages for clock_nanosleep(2) say: "Furthermore, after the
sleep completes, there may still be a delay before the CPU
becomes free to once again execute the calling thread."
I tried to comprehend your question. For this I created the source code below based on the reference at SO which you provided. I include the source code such that you or someone else can check it, test it, play with it.
The debug print refers to a sleep of exactly 1 second. The debug print is shorter than the print in the comments - and the debug print will always refer to the deviation from 1 second, no matter which wakeTime has been defined. Thus, it is possible, to try a reduced wakeTime (wakeTime.tv_nsec-= some_value;) to achieve the target of 1 second.
Conclusions:
I would generally agree to all you (davegravy) write about it in your post, except that I am seeing much higher delays and deviations.
There are minor changes in the delay between a non-loaded and a heavy loaded system (all CPUs 100% load). On heavy loaded system scattering of delay reduces and the average delay also reduces (on my system - but not very significant).
As expected, the delay changes quite a bit when I try it on another machine (as expected raspberry pi is worse :o).
For a specific machine and moment it is possible to define a correction value of nanoseconds to bring the average sleep closer to the target. Anyway, the correction value is not necessarily equal to the delay error without correction. And the correction value might be different for different machines.
Idea: As the provided code can measure how good it is. There might be the chance, that the code does a few loops from which it can derive an optimized delay correction value by itself. (This auto-correction might be interesting just from a theoretical point of view. Well, it is an idea.)
Idea 2: Or some correction values can be created just to avoid a long-term shift when considering many intervals, one after another.
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#define CLOCK CLOCK_MONOTONIC
//#define CLOCK CLOCK_REALTIME
//#define CLOCK CLOCK_TAI
//#define CLOCK CLOCK_BOOTTIME
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void* tickThread()
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
while(1)
{
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 1;
wakeTime.tv_nsec -= 0; // Value to play with for delay "correction"
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, &currentTime);
sleepTime = calcTimeDiff(&currentTime, &sleepStart);
wakeDelay = calcTimeDiff(&currentTime, &wakeTime);
{
/*printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);*/
// Debug Short Print with respect to target sleep = 1 sec. = 1000000000 ns
long debugTargetDelay=sleepTime-1000000000;
printf("sleep=%-ld delay=%-ld targetdelay=%-ld\n",
sleepTime, wakeDelay, debugTargetDelay);
}
}
}
int main(int argc, char*argv[])
{
tickThread();
}
Some output with wakeTime.tv_nsec -= 0;
sleep=1000095788 delay=96104 targetdelay=95788
sleep=1000078989 delay=79155 targetdelay=78989
sleep=1000080717 delay=81023 targetdelay=80717
sleep=1000068001 delay=68251 targetdelay=68001
sleep=1000080475 delay=80519 targetdelay=80475
sleep=1000110925 delay=110977 targetdelay=110925
sleep=1000082415 delay=82561 targetdelay=82415
sleep=1000079572 delay=79713 targetdelay=79572
sleep=1000098609 delay=98664 targetdelay=98609
and with wakeTime.tv_nsec -= 65000;
sleep=1000031711 delay=96987 targetdelay=31711
sleep=1000009400 delay=74611 targetdelay=9400
sleep=1000015867 delay=80912 targetdelay=15867
sleep=1000015612 delay=80708 targetdelay=15612
sleep=1000030397 delay=95592 targetdelay=30397
sleep=1000015299 delay=80475 targetdelay=15299
sleep=999993542 delay=58614 targetdelay=-6458
sleep=1000031263 delay=96310 targetdelay=31263
sleep=1000002029 delay=67169 targetdelay=2029
sleep=1000031671 delay=96821 targetdelay=31671
sleep=999998462 delay=63608 targetdelay=-1538
Anyway, the delays change all the time. I tried different CLOCK definitions and different compiler options, but without any special results.
Some statistics from further testing, sample size = 100 in both cases.
targetdelay from wakeTime.tv_nsec -= 0;
Mean value = 97503 Standard deviation = 27536
targetdelay from wakeTime.tv_nsec -= 97508;
Mean value = -1909 Standard deviation = 32682
In both cases, there were a few massive outliers, such that even this result from 100 samples might not quite be representative.

How to implement a timer for every second with zero nanosecond with liburing?

I noticed the io_uring kernel side uses CLOCK_MONOTONIC at CLOCK_MONOTONIC, so for the first timer, I get the time with both CLOCK_REALTIME and CLOCK_MONOTONIC and adjust the nanosecond like below and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout. iorn/clock.c at master · hnakamur/iorn
const long sec_in_nsec = 1000000000;
static int queue_timeout(iorn_queue_t *queue) {
iorn_timeout_op_t *op = calloc(1, sizeof(*op));
if (op == NULL) {
return -ENOMEM;
}
struct timespec rts;
int ret = clock_gettime(CLOCK_REALTIME, &rts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_REALTIME error: %s\n", strerror(errno));
return -errno;
}
long nsec_diff = sec_in_nsec - rts.tv_nsec;
ret = clock_gettime(CLOCK_MONOTONIC, &op->ts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_MONOTONIC error: %s\n", strerror(errno));
return -errno;
}
op->handler = on_timeout;
op->ts.tv_sec++;
op->ts.tv_nsec += nsec_diff;
if (op->ts.tv_nsec > sec_in_nsec) {
op->ts.tv_sec++;
op->ts.tv_nsec -= sec_in_nsec;
}
op->count = 1;
op->flags = IORING_TIMEOUT_ABS;
ret = iorn_prep_timeout(queue, op);
if (ret < 0) {
return ret;
}
return iorn_submit(queue);
}
From the second time, I just increment the second part tv_sec and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout.
Here is the output from my example program. The millisecond part is zero but it is about 400 microsecond later than just second.
on_timeout time=2020-05-10T14:49:42.000442
on_timeout time=2020-05-10T14:49:43.000371
on_timeout time=2020-05-10T14:49:44.000368
on_timeout time=2020-05-10T14:49:45.000372
on_timeout time=2020-05-10T14:49:46.000372
on_timeout time=2020-05-10T14:49:47.000373
on_timeout time=2020-05-10T14:49:48.000373
Could you tell me a better way than this?
Thanks for your comments! I'd like to update the current time for logging like ngx_time_update(). I modified my example to use just CLOCK_REALTIME, but still about 400 microseconds late. github.com/hnakamur/iorn/commit/… Does it mean clock_gettime takes about 400 nanoseconds on my machine?
Yes, that sounds about right, sort of. But, if you're on an x86 PC under linux, 400 ns for clock_gettime overhead may be a bit high (order of magnitude higher--see below). If you're on an arm CPU (e.g. Raspberry Pi, nvidia Jetson), it might be okay.
I don't know how you're getting 400 microseconds. But, I've had to do a lot of realtime stuff under linux, and 400 us is similar to what I've measured as the overhead to do a context switch and/or wakeup a process/thread after a syscall suspends it.
I never use gettimeofday anymore. I now just use clock_gettime(CLOCK_REALTIME,...) because it's the same except you get nanoseconds instead of microseconds.
Just so you know, although clock_gettime is a syscall, nowadays, on most systems, it uses the VDSO layer. The kernel injects special code into the userspace app, so that it is able to access the time directly without the overhead of a syscall.
If you're interested, you could run under gdb and disassemble the code to see that it just accesses some special memory locations instead of doing a syscall.
I don't think you need to worry about this too much. Just use clock_gettime(CLOCK_MONOTONIC,...) and set flags to 0. The overhead doesn't factor into this, for the purposes of the ioring call as your iorn layer is using it.
When I do this sort of thing, and I want/need to calculate the overhead of clock_gettime itself, I call clock_gettime in a loop (e.g. 1000 times), and try to keep the total time below a [possible] timeslice. I use the minimum diff between times in each iteration. That compensates for any [possible] timeslicing.
The minimum is the overhead of the call itself [on average].
There are additional tricks that you can do to minimize latency in userspace (e.g. raising process priority, clamping CPU affinity and I/O interrupt affinity), but they can involve a few more things, and, if you're not very careful, they can produce worse results.
Before you start taking extraordinary measures, you should have a solid methodology to measure timing/benchmarking to prove that your results can not meet your timing/throughput/latency requirements. Otherwise, you're doing complicated things for no real/measurable/necessary benefit.
Below is some code I just created, simplified, but based on code I already have/use to calibrate the overhead:
#include <stdio.h>
#include <time.h>
#define ITERMAX 10000
typedef long long tsc_t;
// tscget -- get time in nanoseconds
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
// tscsec -- convert nanoseconds to fractional seconds
double
tscsec(tsc_t tsc)
{
double sec;
sec = tsc;
sec /= 1e9;
return sec;
}
tsc_t
calibrate(void)
{
tsc_t tscbeg;
tsc_t tscold;
tsc_t tscnow;
tsc_t tscdif;
tsc_t tscmin;
int iter;
tscmin = 1LL << 62;
tscbeg = tscget();
tscold = tscbeg;
for (iter = ITERMAX; iter > 0; --iter) {
tscnow = tscget();
tscdif = tscnow - tscold;
if (tscdif < tscmin)
tscmin = tscdif;
tscold = tscnow;
}
tscdif = tscnow - tscbeg;
printf("MIN:%.9f TOT:%.9f AVG:%.9f\n",
tscsec(tscmin),tscsec(tscdif),tscsec(tscnow - tscbeg) / ITERMAX);
return tscmin;
}
int
main(void)
{
calibrate();
return 0;
}
On my system, a 2.67GHz Core i7, the output is:
MIN:0.000000019 TOT:0.000254999 AVG:0.000000025
So, I'm getting 25 ns overhead [and not 400 ns]. But, again, each system can be different to some extent.
UPDATE:
Note that x86 processors have "speed step". The OS can adjust the CPU frequency up or down semi-automatically. Lower speeds conserve power. Higher speeds are maximum performance.
This is done with a heuristic (e.g. if the OS detects that the process is a heavy CPU user, it will up the speed).
To force maximum speed, linux has this directory:
/sys/devices/system/cpu/cpuN/cpufreq
Where N is the cpu number (e.g. 0-7)
Under this directory, there are a number of files of interest. They should be self explanatory.
In particular, look at scaling_governor. It has either ondemand [kernel will adjust as needed] or performance [kernel will force maximum CPU speed].
To force maximum speed, as root, set this [once] to performance (e.g.):
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Do this for all cpus.
However, I just did this on my system, and it had little effect. So, the kernel's heuristic may have improved.
As to the 400us, when a process has been waiting on something, when it is "woken up", this is a two step process.
The process is marked "runnable".
At some point, the system/CPU does a reschedule. The process will be run, based upon the scheduling policy and the process priority in effect.
For many syscalls, the reschedule [only] occurs on the next system timer/clock tick/interrupt. So, for some, there can be a delay of up to a full clock tick (i.e.) for HZ value of 1000, this can be up to 1ms (1000 us) later.
On average, this is one half of HZ or 500 us.
For some syscalls, when the process is marked runnable, a reschedule is done immediately. If the process has a higher priority, it will be run immediately.
When I first looked at this [circa 2004], I looked at all code paths in the kernel, and the only syscall that did the immediate reschedule was SysV IPC, for msgsnd/msgrcv. That is, when process A did msgsnd, any process B waiting for the given message would be run.
But, others did not (e.g. futex). They would wait for the timer tick. A lot has changed since then, and now, more syscalls will do the immediate reschedule. For example, I recently measured futex [invoked via pthread_mutex_*], and it seemed to do the quick reschedule.
Also, the kernel scheduler has changed. The newer scheduler can wakeup/run some things on a fraction of a clock tick.
So, for you, the 400 us, is [possibly] the alignment to the next clock tick.
But, it could just be the overhead of doing the syscall. To test that, I modified my test program to open /dev/null [and/or /dev/zero], and added read(fd,buf,1) to the test loop.
I got a MIN: value of 529 us. So, the delay you're getting could just be the amount of time it takes to do the task switch.
This is what I would call "good enough for now".
To get "razor's edge" response, you'd probably have to write a custom kernel driver and have the driver do this. This is what embedded systems would do if (e.g.) they had to toggle a GPIO pin on every interval.
But, if all you're doing is printf, the overhead of printf and the underlying write(1,...) tends to swamp the actual delay.
Also, note that when you do printf, it builds the output buffer and when the buffer in FILE *stdout is full, it flushes via write.
For best performance, it's better to do int len = sprintf(buf,"current time is ..."); write(1,buf,len);
Also, when you do this, if the kernel buffers for TTY I/O get filled [which is quite possible given the high frequency of messages you're doing], the process will be suspended until the I/O has been sent to the TTY device.
To do this well, you'd have to watch how much space is available, and skip some messages if there isn't enough space to [wholy] contain them.
You'd need to do: ioctl(1,TIOCOUTQ,...) to get the available space and skip some messages if it is less than the size of the message you want to output (e.g. the len value above).
For your usage, you're probably more interested in the latest time message, rather than outputting all messages [which would eventually produce a lag]

Run code for x amount of time

To preface, I am on a Unix (linux) system using gcc.
What I am stuck on is how to accurately implement a way to run a section of code for a certain amount of time.
Here is an example of something I have been working with:
struct timeb start, check;
int64_t duration = 10000;
int64_t elapsed = 0;
ftime(&start);
while ( elapsed < duration ) {
// do a set of tasks
ftime(&check);
elapsed += ((check.time - start.time) * 1000) + (check.millitm - start.millitm);
}
I was thinking this would have carried on for 10000ms or 10 seconds, but it didn't, almost instantly. I was basing this off other questions such as How to get the time elapsed in C in milliseconds? (Windows) . But then I thought that if upon the first call of ftime, the struct is time = 1, millitm = 999 and on the second call time = 2, millitm = 01 it would be calculating the elapsed time as being 1002 milliseconds. Is there something I am missing?
Also the suggestions in the various stackoverflow questions, ftime() and gettimeofday(), are listed as deprecated or legacy.
I believe I could convert the start time into milliseconds, and the check time into millseconds, then subtract start from check. But milliseconds since the epoch requires 42 bits and I'm trying to keep everything in the loop as efficient as possible.
What approach could I take towards this?
Code is incorrect calculating elapsed time.
// elapsed += ((check.time - start.time) * 1000) + (check.millitm - start.millitm);
elapsed = ((check.time - start.time) * (int64_t)1000) + (check.millitm - start.millitm);
There is some concern about check.millitm - start.millitm. On systems with struct timeb *tp, it can be expected that the millitm will be promoted to int before subtraction occurs. So the difference will be in the range [-1000 ... 1000].
struct timeb {
time_t time;
unsigned short millitm;
short timezone;
short dstflag;
};
IMO, more robust code would handle ms conversion in a separate helper function. This matches OP's "I believe I could convert the start time into milliseconds, and the check time into millseconds, then subtract start from check."
int64_t timeb_to_ms(struct timeb *t) {
return (int64_t)t->time * 1000 + t->millitm;
}
struct timeb start, check;
ftime(&start);
int64_t start_ms = timeb_to_ms(&start);
int64_t duration = 10000 /* ms */;
int64_t elapsed = 0;
while (elapsed < duration) {
// do a set of tasks
struct timeb check;
ftime(&check);
elapsed = timeb_to_ms(&check) - start_ms;
}
If you want efficiency, let the system send you a signal when a timer expires.
Traditionally, you can set a timer with a resolution in seconds with the alarm(2) syscall.
The system then sends you a SIGALRM when the timer expires. The default disposition of that signal is to terminate.
If you handle the signal, you can longjmp(2) from the handler to another place.
I don't think it gets much more efficient than SIGALRM + longjmp (with an asynchronous timer, your code basically runs undisturbed without having to do any extra checks or calls).
Below is an example for you:
#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
static jmp_buf jmpbuf;
void hndlr();
void loop();
int main(){
/*sisv_signal handlers get reset after a signal is caught and handled*/
if(SIG_ERR==sysv_signal(SIGALRM,hndlr)){
perror("couldn't set SIGALRM handler");
return 1;
}
/*the handler will jump you back here*/
setjmp(jmpbuf);
if(0>alarm(3/*seconds*/)){
perror("couldn't set alarm");
return 1;
}
loop();
return 0;
}
void hndlr(){
puts("Caught SIGALRM");
puts("RESET");
longjmp(jmpbuf,1);
}
void loop(){
int i;
for(i=0; ; i++){
//print each 100-milionth iteration
if(0==i%100000000){
printf("%d\n", i);
}
}
}
If alarm(2) isn't enough, you can use timer_create(2) as EOF suggests.

Simple <Time.h> program takes large amount CPU

I was trying to familiarize myself with the C time.h library by writing something simple in VS. The following code simply prints the value of x added to itself every two seconds:
int main() {
time_t start = time(NULL);
time_t clock = time(NULL);
time_t clockTemp = time(NULL); //temporary clock
int x = 1;
//program will continue for a minute (60 sec)
while (clock <= start + 58) {
clockTemp = time(NULL);
if (clockTemp >= clock + 2) { //if 2 seconds has passed
clock = clockTemp;
x = ADD(x);
printf("%d at %d\n", x, timeDiff(start, clock));
}
}
}
int timeDiff(int start, int at) {
return at - start;
}
My concern is with the amount of CPU that this program takes, about 22%. I figure this problem stems from the constant updating of the clockTemp (just below the while statement), but I'm not sure how to fix this issue. Is it possible that this is a visual studio problem, or is there a special way to check for time?
Solution
the code needed the sleep function so that it wouldn't need to run constantly.
I added sleep with #include <windows.h> and put Sleep (2000) //2 second sleep at the end of the while
while (clock <= start + 58) {
...
Sleep(2000); }
The problem is not in the way you are checking the current time. The problem is that there is nothing to limit the frequency with which the loop runs. Your program continues to execute statements as quickly as it can, and eats up a ton of processor time. (In the absence of other programs, on a single-threaded CPU, it would use 100% of your processor time.)
You need to add a "sleep" method inside your loop, which will indicate to the processor that it can stop processing your program for a short period of time. There are many ways to do this; this question has some examples.

Run code for exactly one second

I would like to know how I can program something so that my program runs as long as a second lasts.
I would like to evaluate parts of my code and see where the time is spend most so I am analyzing parts of it.
Here's the interesting part of my code :
int size = 256
clock_t start_benching = clock();
for (uint32_t i = 0;i < size; i+=4)
{
myarray[i];
myarray[i+1];
myarray[i+2];
myarray[i+3];
}
clock_t stop_benching = clock();
This just gives me how long the function needed to perform all the operations.
I want to run the code for one second and see how many operations have been done.
This is the line to print the time measurement:
printf("Walking through buffer took %f seconds\n", (double)(stop_benching - start_benching) / CLOCKS_PER_SEC);
A better approach to benchmarking is to know the % of time spent on each section of the code.
Instead of making your code run for exactly 1 second, make stop_benchmarking - start_benchmarking the total run time - Take the time spent on any part of the code and divide by the total runtime to get a value between 0 and 1. Multiply this value by 100 and you have the % of time consumed at that specific section.
Non-answer advice: Use an actual profiler to profile the performance of code sections.
On *nix you can set an alarm(2) with a signal handler that sets a global flag to indicate the elapsed time. The Windows API provides something similar with SetTimer.
#include <unistd.h>
#include <signal.h>
int time_elapsed = 0;
void alarm_handler(int signal) {
time_elapsed = 1;
}
int main() {
signal(SIGALRM, &alarm_handler);
alarm(1); // set alarm time-out to 1 second
do {
// stuff...
} while (!time_elapsed);
return 0;
}
In more complicated cases you can use setitimer(2) instead of alarm(2), which lets you
use microsecond precision and
choose between counting
wall clock time,
user CPU time, or
user and system CPU time.

Resources