Using Time stamp counter to get the time stamp - timer

I have used the below code to get the clock cycle of the processor
unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
I get some value say 43, but what is the unit here? Is it in microseconds or nanoseconds.
I used below code to get the frequency of my board.
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
1700000
I also used below code to find my processor speed
dmidecode -t processor | grep "Speed"
Max Speed: 3700 MHz
Current Speed: 3700 MHz
Now how do I use above frequency and convert it to microseconds or milliseconds?

A simple answer to the stated question, "how do I convert the TSC frequency to microseconds or milliseconds?" is: You do not. What the TSC (Time Stamp Counter) clock frequency actually is, varies depending on the hardware, and may vary during runtime on some. To measure real time, you use clock_gettime(CLOCK_REALTIME) or clock_gettime(CLOCK_MONOTONIC) in Linux.
As Peter Cordes mentioned in a comment (Aug 2018), on most current x86-64 architectures the Time Stamp Counter (accessed by the RDTSC instruction and __rdtsc() function declared in <x86intrin.h>) counts reference clock cycles, not CPU clock cycles. His answer to a similar question in C++ is valid for C also in Linux on x86-64, because the compiler provides the underlying built-in when compiling C or C++, and rest of the answer deals with the hardware details. I recommend reading that one, too.
The rest of this answer assumes the underlying issue is microbenchmarking code, to find out how two implementations of some function compare to each other.
On x86 (Intel 32-bit) and x86-64 (AMD64, Intel and AMD 64-bit) architectures, you can use __rdtsc() from <x86intrin.h> to find out the number of TSC clock cycles elapsed. This can be used to measure and compare the number of cycles used by different implementations of some function, typically a large number of times.
Do note that there are hardware differences as to how the TSC clock is related to CPU clock. The abovementioned more recent answer goes into some detail on that. For practical purposes in Linux, it is sufficient in Linux to use cpufreq-set to disable frequency scaling (to ensure the relationship between the CPU and TSC frequencies does not change during microbenchmarking), and optionally taskset to restrict the microbenchmark to specific CPU core(s). That ensures that the results gathered in that microbenchmark yield results that can be compared to each other.
(As Peter Cordes commented, we also want to add _mm_lfence() from <emmintrin.h> (included by <immintrin.h>). This ensures that the CPU does not internally reorder the RDTSC operation compared to the function to be benchmarked. You can use -DNO_LFENCE at compile time to omit those, if you want.)
Let's say you have functions void foo(void); and void bar(void); that you wish to compare:
#include <stdlib.h>
#include <x86intrin.h>
#include <stdio.h>
#ifdef NO_LFENCE
#define lfence()
#else
#include <emmintrin.h>
#define lfence() _mm_lfence()
#endif
static int cmp_ull(const void *aptr, const void *bptr)
{
const unsigned long long a = *(const unsigned long long *)aptr;
const unsigned long long b = *(const unsigned long long *)bptr;
return (a < b) ? -1 :
(a > b) ? +1 : 0;
}
unsigned long long *measure_cycles(size_t count, void (*func)())
{
unsigned long long *elapsed, started, finished;
size_t i;
elapsed = malloc((count + 2) * sizeof elapsed[0]);
if (!elapsed)
return NULL;
/* Call func() count times, measuring the TSC cycles for each call. */
for (i = 0; i < count; i++) {
/* First, let's ensure our CPU executes everything thus far. */
lfence();
/* Start timing. */
started = __rdtsc();
/* Ensure timing starts before we call the function. */
lfence();
/* Call the function. */
func();
/* Ensure everything has been executed thus far. */
lfence();
/* Stop timing. */
finished = __rdtsc();
/* Ensure we have the counter value before proceeding. */
lfence();
elapsed[i] = finished - started;
}
/* The very first call is likely the cold-cache case,
so in case that measurement might contain useful
information, we put it at the end of the array.
We also terminate the array with a zero. */
elapsed[count] = elapsed[0];
elapsed[count + 1] = 0;
/* Sort the cycle counts. */
qsort(elapsed, count, sizeof elapsed[0], cmp_ull);
/* This function returns all cycle counts, in sorted order,
although the median, elapsed[count/2], is the one
I personally use. */
return elapsed;
}
void benchmark(const size_t count)
{
unsigned long long *foo_cycles, *bar_cycles;
if (count < 1)
return;
printf("Measuring run time in Time Stamp Counter cycles:\n");
fflush(stdout);
foo_cycles = measure_cycles(count, foo);
bar_cycles = measure_cycles(count, bar);
printf("foo(): %llu cycles (median of %zu calls)\n", foo_cycles[count/2], count);
printf("bar(): %llu cycles (median of %zu calls)\n", bar_cycles[count/2], count);
free(bar_cycles);
free(foo_cycles);
}
Note that the above results are very specific to the compiler and compiler options used, and of course on the hardware it is run on. The median number of cycles can be interpreted as "the typical number of TSC cycles taken", because the measurement is not completely reliable (may be affected by events outside the process; for example, by context switches, or by migration to another core on some CPUs). For the same reason, I don't trust the minimum, maximum, or average values.
However, the two implementations' (foo() and bar()) cycle counts above can be compared to find out how their performance compares to each other, in a microbenchmark. Just remember that microbenchmark results may not extend to real work tasks, because of how complex tasks' resource use interactions are. One function might be superior in all microbenchmarks, but poorer than others in real world, because it is only efficient when it has lots of CPU cache to use, for example.
In Linux in general, you can use the CLOCK_REALTIME clock to measure real time (wall clock time) used, in the very same manner as above. CLOCK_MONOTONIC is even better, because it is not affected by direct changes to the realtime clock the administrator might make (say, if they noticed the system clock is ahead or behind); only drift adjustments due to NTP etc. are applied. Daylight savings time or changes thereof does not affect the measurements, using either clock. Again, the median of a number of measurements is the result I seek, because events outside the measured code itself can affect the result.
For example:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#ifdef NO_LFENCE
#define lfence()
#else
#include <emmintrin.h>
#define lfence() _mm_lfence()
#endif
static int cmp_double(const void *aptr, const void *bptr)
{
const double a = *(const double *)aptr;
const double b = *(const double *)bptr;
return (a < b) ? -1 :
(a > b) ? +1 : 0;
}
double median_seconds(const size_t count, void (*func)())
{
struct timespec started, stopped;
double *seconds, median;
size_t i;
seconds = malloc(count * sizeof seconds[0]);
if (!seconds)
return -1.0;
for (i = 0; i < count; i++) {
lfence();
clock_gettime(CLOCK_MONOTONIC, &started);
lfence();
func();
lfence();
clock_gettime(CLOCK_MONOTONIC, &stopped);
lfence();
seconds[i] = (double)(stopped.tv_sec - started.tv_sec)
+ (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
}
qsort(seconds, count, sizeof seconds[0], cmp_double);
median = seconds[count / 2];
free(seconds);
return median;
}
static double realtime_precision(void)
{
struct timespec t;
if (clock_getres(CLOCK_REALTIME, &t) == 0)
return (double)t.tv_sec
+ (double)t.tv_nsec / 1000000000.0;
return 0.0;
}
void benchmark(const size_t count)
{
double median_foo, median_bar;
if (count < 1)
return;
printf("Median wall clock times over %zu calls:\n", count);
fflush(stdout);
median_foo = median_seconds(count, foo);
median_bar = median_seconds(count, bar);
printf("foo(): %.3f ns\n", median_foo * 1000000000.0);
printf("bar(): %.3f ns\n", median_bar * 1000000000.0);
printf("(Measurement unit is approximately %.3f ns)\n", 1000000000.0 * realtime_precision());
fflush(stdout);
}
In general, I personally prefer to compile the benchmarked function in a separate unit (to a separate object file), and also benchmark a do-nothing function to estimate the function call overhead (although it tends to give an overestimate for the overhead; i.e. yield too large an overhead estimate, because some of the function call overhead is latencies and not actual time taken, and some operations are possible during those latencies in the actual functions).
It is important to remember that the above measurements should only be used as indications, because in a real world application, things like cache locality (especially on current machines, with multi-level caching, and lots of memory) hugely affect the time used by different implementations.
For example, you might compare the speeds of a quicksort and a radix sort. Depending on the size of the keys, the radix sort requires rather large extra arrays (and uses a lot of cache). If the real application the sort routine is used in does not simultaneously use a lot of other memory (and thus the sorted data is basically what is cached), then a radix sort will be faster if there is enough data (and the implementation is sane). However, if the application is multithreaded, and the other threads shuffle (copy or transfer) a lot of memory around, then the radix sort using a lot of cache will evict other data also cached; even though the radix sort function itself does not show any serious slowdown, it may slow down the other threads and therefore the overall program, because the other threads have to wait for their data to be re-cached.
This means that the only "benchmarks" you should trust, are wall clock measurements used on the actual hardware, running actual work tasks with actual work data. Everything else is subject to many conditions, and are more or less suspect: indications, yes, but not very reliable.

Related

Practical jitter with clock_nanosleep()

I'm trying to establish what practical jitter I can achieve by using clock_nanosleep() in a loop and through experimentation I'm observing something I'm not confident I understand.
I'm using code posted in this SO question by another user to benchmark performance, targeting a 250ms interval. I've observed that on my system the sleep function returns very consistently 10us late with only about 2us jitter the vast majority of the time (fairly narrow statistical distribution).
NOTE: I haven't collected data to present a plot of statistical distribution but casual qualitative description should hopefully suffice.
I decided to subtract the 10us offset from the target wakeup time to compensate for it, and this caused the average error to be approximately zero as expected, however the jitter increased dramatically - I would estimate most wakeups are >100us early/late, and much more widely distributed.
Why is this?
My theory is that with the 10us correction the target waketimes are less nicely aligned with the underlying hardware clock, but it would be helpful to get confirmation. If this is true, is there a method to synchronize the phase of the target waketimes with the hardware clock?
Manpages for clock_nanosleep(2) say: "Furthermore, after the
sleep completes, there may still be a delay before the CPU
becomes free to once again execute the calling thread."
I tried to comprehend your question. For this I created the source code below based on the reference at SO which you provided. I include the source code such that you or someone else can check it, test it, play with it.
The debug print refers to a sleep of exactly 1 second. The debug print is shorter than the print in the comments - and the debug print will always refer to the deviation from 1 second, no matter which wakeTime has been defined. Thus, it is possible, to try a reduced wakeTime (wakeTime.tv_nsec-= some_value;) to achieve the target of 1 second.
Conclusions:
I would generally agree to all you (davegravy) write about it in your post, except that I am seeing much higher delays and deviations.
There are minor changes in the delay between a non-loaded and a heavy loaded system (all CPUs 100% load). On heavy loaded system scattering of delay reduces and the average delay also reduces (on my system - but not very significant).
As expected, the delay changes quite a bit when I try it on another machine (as expected raspberry pi is worse :o).
For a specific machine and moment it is possible to define a correction value of nanoseconds to bring the average sleep closer to the target. Anyway, the correction value is not necessarily equal to the delay error without correction. And the correction value might be different for different machines.
Idea: As the provided code can measure how good it is. There might be the chance, that the code does a few loops from which it can derive an optimized delay correction value by itself. (This auto-correction might be interesting just from a theoretical point of view. Well, it is an idea.)
Idea 2: Or some correction values can be created just to avoid a long-term shift when considering many intervals, one after another.
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#define CLOCK CLOCK_MONOTONIC
//#define CLOCK CLOCK_REALTIME
//#define CLOCK CLOCK_TAI
//#define CLOCK CLOCK_BOOTTIME
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void* tickThread()
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
while(1)
{
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 1;
wakeTime.tv_nsec -= 0; // Value to play with for delay "correction"
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, &currentTime);
sleepTime = calcTimeDiff(&currentTime, &sleepStart);
wakeDelay = calcTimeDiff(&currentTime, &wakeTime);
{
/*printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);*/
// Debug Short Print with respect to target sleep = 1 sec. = 1000000000 ns
long debugTargetDelay=sleepTime-1000000000;
printf("sleep=%-ld delay=%-ld targetdelay=%-ld\n",
sleepTime, wakeDelay, debugTargetDelay);
}
}
}
int main(int argc, char*argv[])
{
tickThread();
}
Some output with wakeTime.tv_nsec -= 0;
sleep=1000095788 delay=96104 targetdelay=95788
sleep=1000078989 delay=79155 targetdelay=78989
sleep=1000080717 delay=81023 targetdelay=80717
sleep=1000068001 delay=68251 targetdelay=68001
sleep=1000080475 delay=80519 targetdelay=80475
sleep=1000110925 delay=110977 targetdelay=110925
sleep=1000082415 delay=82561 targetdelay=82415
sleep=1000079572 delay=79713 targetdelay=79572
sleep=1000098609 delay=98664 targetdelay=98609
and with wakeTime.tv_nsec -= 65000;
sleep=1000031711 delay=96987 targetdelay=31711
sleep=1000009400 delay=74611 targetdelay=9400
sleep=1000015867 delay=80912 targetdelay=15867
sleep=1000015612 delay=80708 targetdelay=15612
sleep=1000030397 delay=95592 targetdelay=30397
sleep=1000015299 delay=80475 targetdelay=15299
sleep=999993542 delay=58614 targetdelay=-6458
sleep=1000031263 delay=96310 targetdelay=31263
sleep=1000002029 delay=67169 targetdelay=2029
sleep=1000031671 delay=96821 targetdelay=31671
sleep=999998462 delay=63608 targetdelay=-1538
Anyway, the delays change all the time. I tried different CLOCK definitions and different compiler options, but without any special results.
Some statistics from further testing, sample size = 100 in both cases.
targetdelay from wakeTime.tv_nsec -= 0;
Mean value = 97503 Standard deviation = 27536
targetdelay from wakeTime.tv_nsec -= 97508;
Mean value = -1909 Standard deviation = 32682
In both cases, there were a few massive outliers, such that even this result from 100 samples might not quite be representative.

Equivalent to Arduino millis()

I am currently working on the integration of a "shunt" type sensor on an electronic board. My choice was on a Linear (LTC2947), unfortunately it only has an Arduino driver. I have to translate everything in C under Linux to be compatible with my microprocessor (APQ8009 ARM Cortex-A7). I have a small question about one of the functions:
int16_t LTC2947_wake_up() //Wake up LTC2947 from shutdown mode and measure the wakeup time
{
byte data[1];
unsigned long wakeupStart = millis(), wakeupTime;
LTC2947_WR_BYTE(LTC2947_REG_OPCTL, 0);
do
{
delay(1);
LTC2947_RD_BYTE(LTC2947_REG_OPCTL, data);
wakeupTime = millis() - wakeupStart;
if (data[0] == 0) //! check if we are in idle mode
{
return wakeupTime;
}
if (wakeupTime > 200)
{
//! failed to wake up due to timeout, return -1
return -1;
}
}
while (true);
}
After finding usleep() as equivalent for delay(), I can not find it for millis() in C. Can you help me translate this function please?
Arduino millis() is based on a timer that trips an overflow interrupt at very close to 1 KHz, or 1 millisecond. To achieve the same thing, I suggest you setup a timer on the ARM platform and update a volatile unsigned long variable with a counter. That will be the equivalent of millis().
Here is what millis() is doing behind the scenes:
SIGNAL(TIMER0_OVF_vect)
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;
m += MILLIS_INC;
f += FRACT_INC;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
m += 1;
}
timer0_fract = f;
timer0_millis = m;
timer0_overflow_count++;
}
unsigned long millis()
{
unsigned long m;
uint8_t oldSREG = SREG;
// disable interrupts while we read timer0_millis or we might get an
// inconsistent value (e.g. in the middle of a write to timer0_millis)
cli();
m = timer0_millis;
SREG = oldSREG;
return m;
}
Coming from the embedded world, arguably the first thing you should do when starting a project on a new platform is establish clocks and get a timer interrupt going at a prescribed rate. That is the "Hello World" of embedded systems. ;) If you choose to do this at 1 KHz, you're most of the way there.
#include <time.h>
unsigned int millis () {
struct timespec t ;
clock_gettime ( CLOCK_MONOTONIC_RAW , & t ) ; // change CLOCK_MONOTONIC_RAW to CLOCK_MONOTONIC on non linux computers
return t.tv_sec * 1000 + ( t.tv_nsec + 500000 ) / 1000000 ;
}
or
#include <sys/time.h>
unsigned int millis () {
struct timeval t ;
gettimeofday ( & t , NULL ) ;
return t.tv_sec * 1000 + ( t.tv_usec + 500 ) / 1000 ;
}
The gettimeofday() version probably does not work on non linux computers.
The clock_gettime() version probably does not work with old C compilers.
The arduino millis() returns unsigned long, 32 bit unsigned integer. Most
computers are 32 bit or 64 bit, so there is no need to use long except on
16 bit computers like arduino, so these versions return unsigned int. If
you want to measure a time period longer than 50 days in milliseconds, or if
you want the number of milliseconds since the beginning of unix in 1970, you
need a long long (64 bit) integer.
If a computer clock has the incorrect time, the operating system or system
administrator or program which synchonizes the computer clock with internet
clocks may change the computer clock to the correct time. This will affect
these functions, especially the gettimeofday() version. Usually there is a
big change in the computer clock when the computer boots, connects to the
network, and synchonizes the computer clock with the network time server.
But most programs are not running this early in the boot process, and thus
are not affected. Usually other changes to the computer clock are very
small, and the effect on other programs is very small. So usually changes
to the computer clock are not a problem.
The clock_gettime() requires a clock id.
CLOCK_MONOTONIC is not affected by discontinuous jumps in the system time,
but is affected by incremental adjustments, and does not count time computer
is suspended.
CLOCK_MONOTONIC_RAW is linux only, not affected by discontinuous jumps in
the system time, not affected by incremental adjustments, does not count
time computer is suspended.
CLOCK_BOOTTIME is linux only, not affected by discontinuous jumps in the
system time, but is affected by incremental adjustments, does count time
computer is suspended. It counts the time since the computer booted.
CLOCK_REALTIME is affected by discontinuous jumps in the system time, and by
incremental adjustments. It does count the time the computer is suspended.
It counts standard unix time (time since the beginning of unix in 1970).
I think CLOCK_MONOTONIC_RAW is the best choice for linux, and
CLOCK_MONOTONIC is the best choice for non linux. Usually millisecond time
is used to measure short periods of time, like how long it takes for part of
a computer program to run. In a short period of time, there will probably
be no changes to the computer clock, and the computer will probably not be
suspended, so any clock id will work, so the choice of clock id is not
important.
Precise time measurements are unreliable on multitasking computers because
the time measurement might be interrupted. Errors are usually small.
Sometimes this is a problem, and sometimes it isn't. If you need more
precise time measurements, you need dedicated hardware which cannot be
interrupted. Some computers have such hardware built in. For example, if a
program uses software pwm, changes to the output will be delayed if the
computer is interrupted at the time the computer needs to change the output.
But if the program uses hardware pwm, the hardware pwm controller cannot be
interrupted, and will change the output at the correct time.
Tested on a raspberry pi.
I hope this be useful. Works for me under Lubuntu 20.04 LTS.
#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>
struct timeval __millis_start;
void init_millis() {
gettimeofday(&__millis_start, NULL);
};
unsigned long int millis() {
long mtime, seconds, useconds;
struct timeval end;
gettimeofday(&end, NULL);
seconds = end.tv_sec - __millis_start.tv_sec;
useconds = end.tv_usec - __millis_start.tv_usec;
mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
return mtime;
};
int main()
{
init_millis();
printf("Elapsed time: %ld milliseconds\n", millis());
return 0;
}
Note:
Based on the discussion in comments (with dear #MarcCompere), I must mention that the conversion of seconds and useconds to mtime in the millis function is rounded by adding 0.5 (read comments to understand how!); but the 0.5 can be removed. It depends on your application. If you are using millis for accurate time measurement then add it to lower the "Mean Squared Error (MSE)" of conversion statistically. But if you need timing for general logic-based decisions (or closer behavior to that of Arduino), then the floor (natural behaviour when casting in this case) can be considered as the better option, so do not add the 0.5.

Measuring processor ticks in C

I wanted to calculate the difference in execution time when executing the same code inside a function. To my surprise, however, sometimes the clock difference is 0 when I use clock()/clock_t for the start and stop timer. Does this mean that clock()/clock_t does not actually return the number of clicks the processor spent on the task?
After a bit of searching, it seemed to me that clock_gettime() would return more fine grained results. And indeed it does, but I instead end up with an abitrary number of nano(?)seconds. It gives a hint of the difference in execution time, but it's hardly accurate as to exactly how many clicks difference it amounts to. What would I have to do to find this out?
#include <math.h>
#include <stdio.h>
#include <time.h>
#define M_PI_DOUBLE (M_PI * 2)
void rotatetest(const float *x, const float *c, float *result) {
float rotationfraction = *x / *c;
*result = M_PI_DOUBLE * rotationfraction;
}
int main() {
int i;
long test_total = 0;
int test_count = 1000000;
struct timespec test_time_begin;
struct timespec test_time_end;
float r = 50.f;
float c = 2 * M_PI * r;
float x = 3.f;
float result_inline = 0.f;
float result_function = 0.f;
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
float rotationfraction = x / c;
result_inline = M_PI_DOUBLE * rotationfraction;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Inline clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count,result_inline);
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
rotatetest(&x, &c, &result_function);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Function clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count, result_inline);
return 0;
}
I am using gcc version 4.8.4 on Linux 3.13.0-37-generic (Linux Mint 16)
First of all: As already mentioned in the comments, clocking a single run of execution one by the other will probably do you no good. If all goes down the hill, the call for getting the time might actually take longer than the actual execution of the operation.
Please clock multiple runs of the operation (including a warm up phase so everything is swapped in) and calculate the average running times.
clock() isn't guaranteed to be monotonic. It also isn't the number of processor clicks (whatever you define this to be) the program has run. The best way to describe the result from clock() is probably "a best effort estimation of the time any one of the CPUs has spent on calculation for the current process". For benchmarking purposes clock() is thus mostly useless.
As per specification:
The clock() function returns the implementation's best approximation to the processor time used by the process since the beginning of an implementation-dependent time related only to the process invocation.
And additionally
To determine the time in seconds, the value returned by clock() should be divided by the value of the macro CLOCKS_PER_SEC.
So, if you call clock() more often than the resolution, you are out of luck.
For profiling/benchmarking, you should --if possible-- use one of the performance clocks that are available on modern hardware. The prime candidates are probably
The HPET
The TSC
Edit: The question now references CLOCK_PROCESS_CPUTIME_ID, which is Linux' way of exposing the TSC.
If any (or both) are available depends on the hardware in is also operating system specific.
After googling a little bit I can see that clock() function can be used as a standard mechanism to find the tome taken for execution , but be aware that the time will be varying at different time depending upon the load of your processor,
You can just use the below code for calculation
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

L1 Cache Line Size

I am trying to determine the L1 cache line size through a C code on a platform where L1 I D cache are 32 KB each and L2 cache is 2MB.
#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<time.h>
#define SIZE 100
long long wall_clock_time();
int main()
{
int *arr=calloc(SIZE,sizeof(int));
register int r,i;
long long before,after;
double time_elapsed;
for(i=0;i<SIZE;i++)
{
before=wall_clock_time();
r=arr[i];
after=wall_clock_time();
time_elapsed=((float)(after - before))/1000000000;
printf("Element Index = %d, Time Taken = %1.4fn",i,time_elapsed);
}
free(arr);
return 0;
}
long long wall_clock_time() {
#ifdef __linux__
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
#else
struct timeval tv;
gettimeofday(&tv, NULL);
return (long long)(tv.tv_usec * 1000 + (long long)tv.tv_sec * 1000000000ll);
#endif
}
Above is a small code snippet that I am using to access elements of an array and trying to determine the jump in access delay at cache line boundaries. However, when I execute the code I get all the timing outputs as 0.000. I have read several threads on stackoverflow regarding this topic but couldn't understand much, hence attempted to write this code.
Can anybody explain to me whether there is an error conceptually or syntactically?
The 0.00 should have hinted that you're measuring something too small. The overhead of calling the measurement function is several magnitudes higher than what you measure.
Instead, measure the overall time it takes you to pass the array, and divide by SIZE to amortize it. Since SIZE is also rather small, you should probably repeat this action several hundreds of times and amortize over the entire thing.
Note that this still won't give you the latency, but rather the throughput of accesses. You'll need to come up with a way to measure the line size from that (try reading from the 2nd level cache, and use the fact the reads to the same line would hit in the L1. By increasing your step, you'll be able to see when your BW stops degrading and stays constant).

Measuring time in millisecond precision

My program is going to race different sorting algorithms against each other, both in time and space. I've got space covered, but measuring time is giving me some trouble. Here is the code that runs the sorts:
void test(short* n, short len) {
short i, j, a[1024];
for(i=0; i<2; i++) { // Loop over each sort algo
memused = 0; // Initialize memory marker
for(j=0; j<len; j++) // Copy scrambled list into fresh array
a[j] = n[j]; // (Sorting algos are in-place)
// ***Point A***
switch(i) { // Pick sorting algo
case 0:
selectionSort(a, len);
case 1:
quicksort(a, len);
}
// ***Point B***
spc[i][len] = memused; // Record how much mem was used
}
}
(I removed some of the sorting algos for simplicity)
Now, I need to measure how much time the sorting algo takes. The most obvious way to do this is to record the time at point (a) and then subtract that from the time at point (b). But none of the C time functions are good enough:
time() gives me time in seconds, but the algos are faster than that, so I need something more accurate.
clock() gives me CPU ticks since the program started, but seems to round to the nearest 10,000; still not small enough
The time shell command works well enough, except that I need to run over 1,000 tests per algorithm, and I need the individual time for each one.
I have no idea what getrusage() returns, but it's also too long.
What I need is time in units (significantly, if possible) smaller than the run time of the sorting functions: about 2ms. So my question is: Where can I get that?
gettimeofday() has microseconds resolution and is easy to use.
A pair of useful timer functions is:
static struct timeval tm1;
static inline void start()
{
gettimeofday(&tm1, NULL);
}
static inline void stop()
{
struct timeval tm2;
gettimeofday(&tm2, NULL);
unsigned long long t = 1000 * (tm2.tv_sec - tm1.tv_sec) + (tm2.tv_usec - tm1.tv_usec) / 1000;
printf("%llu ms\n", t);
}
For measuring time, use clock_gettime with CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW if it is available). Where possible, avoid using gettimeofday. It is specifically deprecated in favor of clock_gettime, and the time returned from it is subject to adjustments from time servers, which can throw off your measurements.
You can get the total user + kernel time (or choose just one) using getrusage as follows:
#include <sys/time.h>
#include <sys/resource.h>
double get_process_time() {
struct rusage usage;
if( 0 == getrusage(RUSAGE_SELF, &usage) ) {
return (double)(usage.ru_utime.tv_sec + usage.ru_stime.tv_sec) +
(double)(usage.ru_utime.tv_usec + usage.ru_stime.tv_usec) / 1.0e6;
}
return 0;
}
I elected to create a double containing fractional seconds...
double t_begin, t_end;
t_begin = get_process_time();
// Do some operation...
t_end = get_process_time();
printf( "Elapsed time: %.6f seconds\n", t_end - t_begin );
The Time Stamp Counter could be helpful here:
static unsigned long long rdtsctime() {
unsigned int eax, edx;
unsigned long long val;
__asm__ __volatile__("rdtsc":"=a"(eax), "=d"(edx));
val = edx;
val = val << 32;
val += eax;
return val;
}
Though there are some caveats to this. The timestamps for different processor cores may be different, and changing clock speeds (due to power saving features and the like) can cause erroneous results.

Resources