Precise Linux Timing - What Determines the Resolution of clock_gettime()? - c

I need to do precision timing to the 1 us level to time a change in duty cycle of a pwm wave.
Background
I am using a Gumstix Over Water COM (https://www.gumstix.com/store/app.php/products/265/) that has a single core ARM Cortex-A8 processor running at 499.92 BogoMIPS (the Gumstix page claims up to 1Ghz with 800Mhz recommended) according to /proc/cpuinfo. The OS is an Angstrom Image version of Linux based of kernel version 2.6.34 and it is stock on the Gumstix Water COM.
The Problem
I have done a fair amount of reading about precise timing in Linux (and have tried most of it) and the consensus seems to be that using clock_gettime() and referencing CLOCK_MONOTONIC is the best way to do it. (I would have liked to use the RDTSC register for timing since I have one core with minimal power saving abilities but this is not an Intel processor.) So here is the odd part, while clock_getres() returns 1, suggesting resolution at 1 ns, actual timing tests suggest a minimum resolution of 30517ns or (it can't be coincidence) exactly the time between a 32.768KHz clock ticks. Here's what I mean:
// Stackoverflow example
#include <stdio.h>
#include <time.h>
#define SEC2NANOSEC 1000000000
int main( int argc, const char* argv[] )
{
// //////////////// Min resolution test //////////////////////
struct timespec resStart, resEnd, ts;
ts.tv_sec = 0; // s
ts.tv_nsec = 1; // ns
int iters = 100;
double resTime,sum = 0;
int i;
for (i = 0; i<iters; i++)
{
clock_gettime(CLOCK_MONOTONIC, &resStart); // start timer
// clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, &ts);
clock_gettime(CLOCK_MONOTONIC, &resEnd); // end timer
resTime = ((double)resEnd.tv_sec*SEC2NANOSEC + (double)resEnd.tv_nsec
- ((double)resStart.tv_sec*SEC2NANOSEC + (double)resStart.tv_nsec);
sum = sum + resTime;
printf("resTime = %f\n",resTime);
}
printf("Average = %f\n",sum/(double)iters);
}
(Don't fret over the double casting, tv_sec in a time_t and tv_nsec is a long.)
Compile with:
gcc soExample.c -o runSOExample -lrt
Run with:
./runSOExample
With the nanosleep commented out as shown, the result is either 0ns or 30517ns with the majority being 0ns. This leads me to believe that CLOCK_MONOTONIC is updated at 32.768kHz and most of the time the clock has not been updated before the second clock_gettime() call is made and in cases where the result is 30517ns the clock has been updated between calls.
When I do the same thing on my development computer (AMD FX(tm)-6100 Six-Core Processor running at 1.4 GHz) the minimum delay is a more constant 149-151ns with no zeros.
So, let's compare those results to the CPU speeds. For the Gumstix, that 30517ns (32.768kHz) equates to 15298 cycles of the 499.93MHz cpu. For my dev computer that 150ns equates to 210 cycles of the 1.4Ghz CPU.
With the clock_nanosleep() call uncommented the average results are these:
Gumstix: Avg value = 213623 and the result varies, up and down, by multiples of that min resolution of 30517ns
Dev computer: 57710-68065 ns with no clear trend. In the case of the dev computer I expect the resolution to actually be at the 1 ns level and the measured ~150ns truly is the time elapsed between the two clock_gettime() calls.
So, my question's are these:
What determines that minimum resolution?
Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
Thanks! If you need more info or clarification just ask.

As I understand, the difference between two environments(Gumstix and your Dev-computer) might be the underlying timer h/w they are using.
Commented nanosleep() case:
You are using clock_gettime() twice. To give you a rough idea of what this clock_gettime() will ultimately get mapped to(in kernel):
clock_gettime -->clock_get() -->posix_ktime_get_ts -->ktime_get_ts() -->timekeeping_get_ns()
-->clock->read()
clock->read() basically reads the value of the counter provided by underlying timer driver and corresponding h/w. A simple difference with stored value of the counter in the past and current counter value and then nanoseconds conversion mathematics will yield you the nanoseconds elapsed and will update the time-keeping data structures in kernel.
For example , if you have a HPET timer which gives you a 10 MHz clock, the h/w counter will get updated at 100 ns time interval.
Lets say, on first clock->read(), you get a counter value of X.
Linux Time-keeping data structures will read this value of X, get the difference 'D'compared to some old stored counter value.Do some counter-difference 'D' to nanoseconds 'n' conversion mathematics, update the data-structure by 'n'
Yield this new time value to the user space.
When second clock->read() is issued, it will again read the counter and update the time.
Now, for a HPET timer, this counter is getting updated every 100ns and hence , you will see this difference being reported to the user-space.
Now, Let's replace this HPET timer with a slow 32.768 KHz clock. Now , clock->read()'s counter will updated only after 30517 ns seconds, so, if you second call to clock_gettime() is before this period, you will get 0(which is majority of the cases) and in some cases, your second function call will be placed after counter has incremented by 1, i.e 30517 ns has elapsed. Hence , the value of 30517 ns sometimes.
Uncommented Nanosleep() case:
Let's trace the clock_nanosleep() for monotonic clocks:
clock_nanosleep() -->nsleep --> common_nsleep() -->hrtimer_nanosleep() -->do_nanosleep()
do_nanosleep() will simply put the current task in INTERRUPTIBLE state, will wait for the timer to expire(which is 1 ns) and then set the current task in RUNNING state again. You see, there are lot of factors involved now, mainly when your kernel thread (and hence the user space process) will be scheduled again. Depending on your OS, you will always face some latency when your doing a context-switch and this is what we observe with the average values.
Now Your questions:
What determines that minimum resolution?
I think the resolution/precision of your system will depend on the underlying timer hardware being used(assuming your OS is able to provide that precision to the user space process).
*Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?*
Sorry, I missed you here. How it is 30000x faster? To me , it looks like something 200x faster(30714 ns/ 150 ns ~ 200X ? ) .But anyway, as I understand, CPU speed may or may not have to do with the timer resolution/precision. So, this assumption may be right in some architectures(when you are using TSC H/W), though, might fail in others(using HPET, PIT etc).
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
you can always look into the kernel code for details(that's how i looked into it).
In linux kernel code , look for these source files and Documentation:
kernel/posix-timers.c
kernel/hrtimer.c
Documentation/timers/hrtimers.txt

I do not have gumstix on hand, but it looks like your clocksource is slow.
run:
$ dmesg | grep clocksource
If you get back
[ 0.560455] Switching to clocksource 32k_counter
This might explain why your clock is so slow.
In the recent kernels there is a directory /sys/devices/system/clocksource/clocksource0 with two files: available_clocksource and current_clocksource. If you have this directory, try switching to a different source by echo'ing its name into second file.

Related

How to write a time difference function to STM32F4

i am working on STM32F4 and pretty new at it. I know basics of C but with more than 1 day research, i still not found a solution of this.
I simply want to make a delay function myself, processor runs at 168MHz ( HCLK ). So my intuition says that it produces 168x10^6 clock cycles at each seconds. So the method should be something like that,
1-Store current clock count to a variable
2-Time diff = ( clock value at any time - stored starting clock value ) / 168000000
This flow should give me time difference in terms of seconds and then i can use it to convert whatever i want.
But, unfortunately, despite it seems so easy, I just cant implement any methods to MCU.
I tried time.h but it did not work properly. For ex, clock() gave same result over and over, and time( the one returns seconds since 1970 ) gave hexadecimal 0xFFFFFFFF ( -1, I guess means error ) .
Thanks.
Edit : While writing i assumed that some func like clock() will return total clock count since the start of program flow, but now i think after 4Billion/168Million secs it will overflow uint32_t size. I am really confused.
The answer depends on the required precision and intervals.
For shorter intervals with sub-microsecond precision there is a cycle counter. Your suspicion is correct, it would overflow after 232/168*106 ~ 25.5 seconds.
For longer intervals there are timers that can be prescaled to support any possible subdivision of the 168 MHz clock. The most commonly used setup is the SysTick timer set to generate an interrupt at 1 kHz frequency, which increments a software counter. Reading this counter would give the number of milliseconds elapsed since startup. As it is usually a 32 bit counter, it would overflow after 49.7 days. The HAL library sets SysTick up this way, the counter can then be queried using the HAL_GetTick() function.
For even longer or more specialized timing requirements you can use the RTC peripheral which keeps calendar time, or the TIM peripherals (basic, general and advanced timers), these have their own prescalers, and they can be arranged in a master-slave setup to give almost arbitrary precision and intervals.

faster equivalent of gettimeofday

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday.
Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ?
Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds.
Trying to do better than the 62 nanoseconds gettimeofday takes
POSIX Clocks
I wrote a benchmark for POSIX clock sources:
time (s) => 3 cycles
ftime (ms) => 54 cycles
gettimeofday (us) => 42 cycles
clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
rdtsc (cycles) => 24 cycles
These numbers are from an Intel Core i7-4771 CPU # 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.
You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).
TSC
Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.
The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).
Memcached
Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.
Have you actually benchmarked, and found gettimeofday to be unacceptably slow?
At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.
While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.
That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).
Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday is going to be fast enough.
However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:
have a global variable, giving the current timestamp in your desired accuracy
have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
all other threads (or the main process, if you don't use threads otherwise) just reads the global variable
The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.
But all this is much overkill for just hundreds of messages per second.
Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?
#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;
void printTime(time_t now)
{
struct tm tstruct;
char buf[80];
tstruct = *localtime(&now);
strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
cout << buf << endl;
}
int main()
{
timeval tv;
time_t tm;
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
for(int i=0; i<100000000; i++)
gettimeofday(&tv,NULL);
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
printTime(time(NULL));
for(int i=0; i<100000000; i++)
tm=time(NULL);
printTime(time(NULL));
return 0;
}
3 sec for 100,000,000 calls or 30ns;
2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41
Do you need the millisecond precision? If not you could simply use time() and deal with the unix timestamp.

CPU Cycle count based profiling in C/C++ Linux x86_64

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?
I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.
Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.
The right way is either:
Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
Use a cycle-level emulating profiler like Callgrind / kcachegrind.
By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.
You are on the right track1, but you need to do two things:
Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.
1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.
Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.
If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.
That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.
The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.
Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?
I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.
Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?
If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}

getting elapsed time since process start

I need a way to get the elapsed time (wall-clock time) since a program started, in a way that is resilient to users meddling with the system clock.
On windows, the non standard clock() implementation doesn't do the trick, as it appears to work just by calculating the difference with the time sampled at start up, so that I get negative values if I "move the clock hands back".
On UNIX, clock/getrusage refer to system time, whereas using function such as gettimeofday to sample timestamps has the same problem as using clock on windows.
I'm not really interested in precision, and I've hacked a solution by having a half a second resolution timer spinning in the background countering the clock skews when they happen
(if the difference between the sampled time and the expected exceeds 1 second i use the expected timer for the new baseline) but I think there must be a better way.
I guess you can always start some kind of timer. For example under Linux a thread
that would have a loop like this :
static void timer_thread(void * arg)
{
struct timespec delay;
unsigned int msecond_delay = ((app_state_t*)arg)->msecond_delay;
delay.tv_sec = 0;
delay.tv_nsec = msecond_delay * 1000000;
while(1) {
some_global_counter_increment();
nanosleep(&delay, NULL);
}
}
Where app_state_t is an application structure of your choice were you store variables. If you want to prevent tampering, you need to be sure no one killed your thread
For POSIX, use clock_gettime() with CLOCK_MONOTONIC.
I don't think you'll find a cross-platform way of doing that.
On Windows what you need is GetTickCount (or maybe QueryPerformanceCounter and QueryPerformanceFrequency for a high resolution timer). I don't have experience with that on Linux, but a search on Google gave me clock_gettime.
Wall clock time can bit calculated with the time() call.
If you have a network connection, you can always acquire the time from an NTP server. This will obviously not be affected in any the local clock.
/proc/uptime on linux maintains the number of seconds that the system has been up (and the number of seconds it has been idle), which should be unaffected by changes to the clock as it's maintained by the system interrupt (jiffies / HZ). Perhaps windows has something similar?

GetLocalTime() API time resolution

I need to find out time taken by a function in my application. Application is a MS VIsual Studio 2005 solution, all C code.
I used thw windows API GetLocalTime(SYSTEMTIME *) to get the current system time before and after the function call which I want to measure time of.
But this has shortcoming that it lowest resolution is only 1msec. Nothing below that. So I cannot get any time granularity in micro seconds.
I know that time() which gives the time elapsed since the epoch time, also has resolution of 1msec (No microseconds)
1.) Is there any other Windows API which gives time in microseconds which I can use to measure the time consumed by my function?
-AD
There are some other possibilities.
QueryPerformanceCounter and QueryPerformanceFrequency
QueryPerformanceCounter will return a "performance counter" which is actually a CPU-managed 64-bit counter that increments from 0 starting with the computer power-on. The frequency of this counter is returned by the QueryPerformanceFrequency. To get the time reference in seconds, divide performance counter by performance frequency. In Delphi:
function QueryPerfCounterAsUS: int64;
begin
if QueryPerformanceCounter(Result) and
QueryPerformanceFrequency(perfFreq)
then
Result := Round(Result / perfFreq * 1000000);
else
Result := 0;
end;
On multiprocessor platforms, QueryPerformanceCounter should return consistent results regardless of the CPU the thread is currently running on. There are occasional problems, though, usually caused by bugs in hardware chips or BIOSes. Usually, patches are provided by motherboard manufacturers. Two examples from the MSDN:
Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2003 and in Windows XP
Performance counter value may unexpectedly leap forward
Another problem with QueryPerformanceCounter is that it is quite slow.
RDTSC instruction
If you can limit your code to one CPU (SetThreadAffinity), you can use RDTSC assembler instruction to query performance counter directly from the processor.
function CPUGetTick: int64;
asm
dw 310Fh // rdtsc
end;
RDTSC result is incremented with same frequency as QueryPerformanceCounter. Divide it by QueryPerformanceFrequency to get time in seconds.
QueryPerformanceCounter is much slower thatn RDTSC because it must take into account multiple CPUs and CPUs with variable frequency. From Raymon Chen's blog:
(QueryPerformanceCounter) counts elapsed time. It has to, since its value is
governed by the QueryPerformanceFrequency function, which returns a number
specifying the number of units per second, and the frequency is spec'd as not
changing while the system is running.
For CPUs that can run at variable speed, this means that the HAL cannot
use an instruction like RDTSC, since that does not correlate with elapsed time.
timeGetTime
TimeGetTime belongs to the Win32 multimedia Win32 functions. It returns time in milliseconds with 1 ms resolution, at least on a modern hardware. It doesn't hurt if you run timeBeginPeriod(1) before you start measuring time and timeEndPeriod(1) when you're done.
GetLocalTime and GetSystemTime
Before Vista, both GetLocalTime and GetSystemTime return current time with millisecond precision, but they are not accurate to a millisecond. Their accuracy is typically in the range of 10 to 55 milliseconds. (Precision is not the same as accuracy)
On Vista, GetLocalTime and GetSystemTime both work with 1 ms resolution.
You can try to use clock() which will provide the number of "ticks" between two points. A "tick" is the smallest unit of time a processor can measure.
As a side note, you can't use clock() to determine the actual time - only the number of ticks between two points in your program.
One caution on multiprocessor systems:
from http://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function.
Al Weiner
On Windows you can use the 'high performance counter API'. Check out: QueryPerformanceCounter and QueryPerformanceCounterFrequency for the details.

Resources