Execution time and gettimeofday on ubuntu / i7

Execution time and gettimeofday on ubuntu / i7 - c

To measure execution time of a function in C,how accurate is the POSIX function gettimeofday() in Ubuntu 12.04 on an Intel i7 ? And why ? If its hard to say,then how to find out ? I can't find a straight answer on this.

If you are building a stopwatch, you want a consistent clock that does not adjust to any time servers.
#include <time.h>
...
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
If CLOCK_MONOTONIC_RAW does not exist, use CLOCK_MONOTONIC instead. If you want time of day, you still should not use gettimeofday.
clock_gettime(CLOCK_REALTIME, &ts);
The timer for such clocks is high resolution, but it shifts. The timespec supports resolution as small as nanoseconds, but your computer will likely only be able to measure somewhere in the microseconds.
gettimeofday returns a timeval, which has a resolution in microseconds.

gettimeofday should be accurate enough, at least for comparing relative performance. But, in order to smooth out inaccuracies due to context switching and IO related delays you should run the function, say, at least hundreds of times and get the average.
You can tune the number of times you need to execute the function by making sure you get approximately the same result each time you benchmark the function.
Also see this answer to measure the accuracy of gettimeofday in your system.

You can use gettimeofday() at start and end of code to find the difference between starting time and finishing time of program. like
//start = gettimeofday
//your code here
//end = gettimeofday
Execution time will be difference between the 2
But it depends on how much load is one the machine is at that point. So not a reliable method.
You can also use:
time ./a.out
assuming your program is a.out

Related

Precise Linux Timing - What Determines the Resolution of clock_gettime()?

I need to do precision timing to the 1 us level to time a change in duty cycle of a pwm wave.
Background
I am using a Gumstix Over Water COM (https://www.gumstix.com/store/app.php/products/265/) that has a single core ARM Cortex-A8 processor running at 499.92 BogoMIPS (the Gumstix page claims up to 1Ghz with 800Mhz recommended) according to /proc/cpuinfo. The OS is an Angstrom Image version of Linux based of kernel version 2.6.34 and it is stock on the Gumstix Water COM.
The Problem
I have done a fair amount of reading about precise timing in Linux (and have tried most of it) and the consensus seems to be that using clock_gettime() and referencing CLOCK_MONOTONIC is the best way to do it. (I would have liked to use the RDTSC register for timing since I have one core with minimal power saving abilities but this is not an Intel processor.) So here is the odd part, while clock_getres() returns 1, suggesting resolution at 1 ns, actual timing tests suggest a minimum resolution of 30517ns or (it can't be coincidence) exactly the time between a 32.768KHz clock ticks. Here's what I mean:
// Stackoverflow example
#include <stdio.h>
#include <time.h>
#define SEC2NANOSEC 1000000000
int main( int argc, const char* argv[] )
{
// //////////////// Min resolution test //////////////////////
struct timespec resStart, resEnd, ts;
ts.tv_sec = 0; // s
ts.tv_nsec = 1; // ns
int iters = 100;
double resTime,sum = 0;
int i;
for (i = 0; i<iters; i++)
{
clock_gettime(CLOCK_MONOTONIC, &resStart); // start timer
// clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, &ts);
clock_gettime(CLOCK_MONOTONIC, &resEnd); // end timer
resTime = ((double)resEnd.tv_sec*SEC2NANOSEC + (double)resEnd.tv_nsec
- ((double)resStart.tv_sec*SEC2NANOSEC + (double)resStart.tv_nsec);
sum = sum + resTime;
printf("resTime = %f\n",resTime);
}
printf("Average = %f\n",sum/(double)iters);
}
(Don't fret over the double casting, tv_sec in a time_t and tv_nsec is a long.)
Compile with:
gcc soExample.c -o runSOExample -lrt
Run with:
./runSOExample
With the nanosleep commented out as shown, the result is either 0ns or 30517ns with the majority being 0ns. This leads me to believe that CLOCK_MONOTONIC is updated at 32.768kHz and most of the time the clock has not been updated before the second clock_gettime() call is made and in cases where the result is 30517ns the clock has been updated between calls.
When I do the same thing on my development computer (AMD FX(tm)-6100 Six-Core Processor running at 1.4 GHz) the minimum delay is a more constant 149-151ns with no zeros.
So, let's compare those results to the CPU speeds. For the Gumstix, that 30517ns (32.768kHz) equates to 15298 cycles of the 499.93MHz cpu. For my dev computer that 150ns equates to 210 cycles of the 1.4Ghz CPU.
With the clock_nanosleep() call uncommented the average results are these:
Gumstix: Avg value = 213623 and the result varies, up and down, by multiples of that min resolution of 30517ns
Dev computer: 57710-68065 ns with no clear trend. In the case of the dev computer I expect the resolution to actually be at the 1 ns level and the measured ~150ns truly is the time elapsed between the two clock_gettime() calls.
So, my question's are these:
What determines that minimum resolution?
Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
Thanks! If you need more info or clarification just ask.

As I understand, the difference between two environments(Gumstix and your Dev-computer) might be the underlying timer h/w they are using.
Commented nanosleep() case:
You are using clock_gettime() twice. To give you a rough idea of what this clock_gettime() will ultimately get mapped to(in kernel):
clock_gettime -->clock_get() -->posix_ktime_get_ts -->ktime_get_ts() -->timekeeping_get_ns()
-->clock->read()
clock->read() basically reads the value of the counter provided by underlying timer driver and corresponding h/w. A simple difference with stored value of the counter in the past and current counter value and then nanoseconds conversion mathematics will yield you the nanoseconds elapsed and will update the time-keeping data structures in kernel.
For example , if you have a HPET timer which gives you a 10 MHz clock, the h/w counter will get updated at 100 ns time interval.
Lets say, on first clock->read(), you get a counter value of X.
Linux Time-keeping data structures will read this value of X, get the difference 'D'compared to some old stored counter value.Do some counter-difference 'D' to nanoseconds 'n' conversion mathematics, update the data-structure by 'n'
Yield this new time value to the user space.
When second clock->read() is issued, it will again read the counter and update the time.
Now, for a HPET timer, this counter is getting updated every 100ns and hence , you will see this difference being reported to the user-space.
Now, Let's replace this HPET timer with a slow 32.768 KHz clock. Now , clock->read()'s counter will updated only after 30517 ns seconds, so, if you second call to clock_gettime() is before this period, you will get 0(which is majority of the cases) and in some cases, your second function call will be placed after counter has incremented by 1, i.e 30517 ns has elapsed. Hence , the value of 30517 ns sometimes.
Uncommented Nanosleep() case:
Let's trace the clock_nanosleep() for monotonic clocks:
clock_nanosleep() -->nsleep --> common_nsleep() -->hrtimer_nanosleep() -->do_nanosleep()
do_nanosleep() will simply put the current task in INTERRUPTIBLE state, will wait for the timer to expire(which is 1 ns) and then set the current task in RUNNING state again. You see, there are lot of factors involved now, mainly when your kernel thread (and hence the user space process) will be scheduled again. Depending on your OS, you will always face some latency when your doing a context-switch and this is what we observe with the average values.
Now Your questions:
What determines that minimum resolution?
I think the resolution/precision of your system will depend on the underlying timer hardware being used(assuming your OS is able to provide that precision to the user space process).
*Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?*
Sorry, I missed you here. How it is 30000x faster? To me , it looks like something 200x faster(30714 ns/ 150 ns ~ 200X ? ) .But anyway, as I understand, CPU speed may or may not have to do with the timer resolution/precision. So, this assumption may be right in some architectures(when you are using TSC H/W), though, might fail in others(using HPET, PIT etc).
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
you can always look into the kernel code for details(that's how i looked into it).
In linux kernel code , look for these source files and Documentation:
kernel/posix-timers.c
kernel/hrtimer.c
Documentation/timers/hrtimers.txt

I do not have gumstix on hand, but it looks like your clocksource is slow.
run:
$ dmesg | grep clocksource
If you get back
[ 0.560455] Switching to clocksource 32k_counter
This might explain why your clock is so slow.
In the recent kernels there is a directory /sys/devices/system/clocksource/clocksource0 with two files: available_clocksource and current_clocksource. If you have this directory, try switching to a different source by echo'ing its name into second file.

measuring time of multi-threading program

I measured time with function clock() but it gave bad results. I mean it gives the same results for program with one thread and for the same program running with OpenMP with many threads. But in fact, I notice with my watch that with many threads program counts faster.
So I need some wall-clock timer...
My question is: What is better function for this issue?
clock_gettime() or mb gettimeofday() ? or mb something else?
if clock_gettime(),then with which clock? CLOCK_REALTIME or CLOCK_MONOTONIC?
using mac os x (snow leopard)

If you want wall-clock time, and clock_gettime() is available, it's a good choice. Use it with CLOCK_MONOTONIC if you're measuring intervals of time, and CLOCK_REALTIME to get the actual time of day.
CLOCK_REALTIME gives you the actual time of day, but is affected by adjustments to the system time -- so if the system time is adjusted while your program runs that will mess up measurements of intervals using it.
CLOCK_MONOTONIC doesn't give you the correct time of day, but it does count at the same rate and is immune to changes to the system time -- so it's ideal for measuring intervals, but useless when correct time of day is needed for display or for timestamps.

I think clock() counts the total CPU usage among all threads, I had this problem too...
The choice of wall-clock timing method is personal preference. I use an inline wrapper function to take time-stamps (take the difference of 2 time-stamps to time your processing). I've used floating point for convenience (units are in seconds, don't have to worry about integer overflow). With multi-threading, there are so many asynchronous events that in my opinion it doesn't make sense to time below 1 microsecond. This has worked very well for me so far :)
Whatever you choose, a wrapper is the easiest way to experiment
inline double my_clock(void) {
struct timeval t;
gettimeofday(&t, NULL);
return (1.0e-6*t.tv_usec + t.tv_sec);
}
usage:
double start_time, end_time;
start_time = my_clock();
//some multi-threaded processing
end_time = my_clock();
printf("time is %lf\n", end_time-start_time);

Use clock() to count program execution time

I'm using something like this to count how long does it takes my program from start to finish:
int main(){
clock_t startClock = clock();
.... // many codes
clock_t endClock = clock();
printf("%ld", (endClock - startClock) / CLOCKS_PER_SEC);
}
And my question is, since there are multiple process running at the same time, say if for x amount of time my process is in idle, durning that time will clock tick within my program?
So basically my concern is, say there's 1000 clock cycle passed by, but my process only uses 500 of them, will I get 500 or 1000 from (endClock - startClock)?
Thanks.

This depends on the OS. On Windows, clock() measures wall-time. On Linux/Posix, it measures the combined CPU time of all the threads.
If you want wall-time on Linux, you should use gettimeofday().
If you want CPU-time on Windows, you should use GetProcessTimes().
EDIT:
So if you're on Windows, clock() will measure idle time.
On Linux, clock() will not measure idle time.

clock on POSIX measures cpu time, but it usually has extremely poor resolution. Instead, modern programs should use clock_gettime with the CLOCK_PROCESS_CPUTIME_ID clock-id. This will give up to nanosecond-resolution results, and usually it's really just about that good.

As per the definition on the man page (in Linux),
The clock() function returns an approximation of processor time used
by the program.
it will try to be as accurate a possible, but as you say, some time (process switching, for example) is difficult to account to a process, so the numbers will be as accurate as possible, but not perfect.

faster equivalent of gettimeofday

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday.
Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ?
Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds.
Trying to do better than the 62 nanoseconds gettimeofday takes

POSIX Clocks
I wrote a benchmark for POSIX clock sources:
time (s) => 3 cycles
ftime (ms) => 54 cycles
gettimeofday (us) => 42 cycles
clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
rdtsc (cycles) => 24 cycles
These numbers are from an Intel Core i7-4771 CPU # 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.
You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).
TSC
Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.
The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).
Memcached
Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?
At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.
While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.
That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday is going to be fast enough.
However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:
have a global variable, giving the current timestamp in your desired accuracy
have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
all other threads (or the main process, if you don't use threads otherwise) just reads the global variable
The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.
But all this is much overkill for just hundreds of messages per second.

Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?
#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;
void printTime(time_t now)
{
struct tm tstruct;
char buf[80];
tstruct = *localtime(&now);
strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
cout << buf << endl;
}
int main()
{
timeval tv;
time_t tm;
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
for(int i=0; i<100000000; i++)
gettimeofday(&tv,NULL);
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
printTime(time(NULL));
for(int i=0; i<100000000; i++)
tm=time(NULL);
printTime(time(NULL));
return 0;
}
3 sec for 100,000,000 calls or 30ns;
2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41

Do you need the millisecond precision? If not you could simply use time() and deal with the unix timestamp.

How can I find the execution time of a section of my program in C?

I'm trying to find a way to get the execution time of a section of code in C. I've already tried both time() and clock() from time.h, but it seems that time() returns seconds and clock() seems to give me milliseconds (or centiseconds?) I would like something more precise though. Is there a way I can grab the time with at least microsecond precision?
This only needs to be able to compile on Linux.

You referred to clock() and time() - were you looking for gettimeofday()?
That will fill in a struct timeval, which contains seconds and microseconds.
Of course the actual resolution is up to the hardware.

For what it's worth, here's one that's just a few macros:
#include <time.h>
clock_t startm, stopm;
#define START if ( (startm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define STOP if ( (stopm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define PRINTTIME printf( "%6.3f seconds used by the processor.", ((double)stopm-startm)/CLOCKS_PER_SEC);
Then just use it with:
main() {
START;
// Do stuff you want to time
STOP;
PRINTTIME;
}
From http://ctips.pbwiki.com/Timer

You want a profiler application.
Search keywords at SO and search engines: linux profiling

Have a look at gettimeofday,
clock_*, or get/setitimer.

Try "bench.h"; it lets you put a START_TIMER; and STOP_TIMER("name"); into your code, allowing you to arbitrarily benchmark any section of code (note: only recommended for short sections, not things taking dozens of milliseconds or more). Its accurate to the clock cycle, though in some rare cases it can change how the code in between is compiled, in which case you're better off with a profiler (though profilers are generally more effort to use for specific sections of code).
It only works on x86.

You might want to google for an instrumentation tool.

You won't find a library call which lets you get past the clock resolution of your platform. Either use a profiler (man gprof) as another poster suggested, or - quick & dirty - put a loop around the offending section of code to execute it many times, and use clock().

gettimeofday() provides you with a resolution of microseconds, whereas clock_gettime() provides you with a resolution of nanoseconds.
int clock_gettime(clockid_t clk_id, struct timespec *tp);
The clk_id identifies the clock to be used. Use CLOCK_REALTIME if you want a system-wide clock visible to all processes. Use CLOCK_PROCESS_CPUTIME_ID for per-process timer and CLOCK_THREAD_CPUTIME_ID for a thread-specific timer.

It depends on the conditions.. Profilers are nice for general global views however if you really need an accurate view my recommendation is KISS. Simply run the code in a loop such that it takes a minute or so to complete. Then compute a simple average based on the total run time and iterations executed.
This approach allows you to:
Obtain accurate results with low resolution timers.
Not run into issues where instrumentation interferes with high speed caches (l2,l1,branch..etc) close to the processor. However running the same code in a tight loop can also provide optimistic results that may not reflect real world conditions.

Don't know which enviroment/OS you are working on, but your timing may be inaccurate if another thread, task, or process preempts your timed code in the middle. I suggest exploring mechanisms such as mutexes or semaphores to prevent other threads from preemting your process.

If you are developing on x86 or x64 why not use the Time Stamp Counter: RDTSC.
It will be more reliable then Ansi C functions like time() or clock() as RDTSC is an atomic function. Using C functions for this purpose can introduce problems as you have no guarantee that the thread they are executing in will not be switched out and as a result the value they return will not be an accurate description of the actual execution time you are trying to measure.
With RDTSC you can better measure this. You will need to convert the tick count back into a human readable time H:M:S format which will depend on the processors clock frequency but google around and I am sure you will find examples.
However even with RDTSC you will be including the time your code was switched out of execution, while a better solution than using time()/clock() if you need an exact measurement you will have to turn to a profiler that will instrument your code and take into account when your code is not actually executing due to context switches or whatever.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight