QueryPerformance counter and Queryperformance frequency in windows

QueryPerformance counter and Queryperformance frequency in windows - c

#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// assuming we return times with microsecond resolution
#define STOPWATCH_TICKS_PER_US 1
uint64_t GetStopWatch()
{
LARGE_INTEGER t, freq;
uint64_t val;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&freq);
return (uint64_t) (t.QuadPart / (double) freq.QuadPart * 1000000);
}
void task()
{
printf("hi\n");
}
int main()
{
uint64_t start = GetStopWatch();
task();
uint64_t stop = GetStopWatch();
printf("Elapsed time (microseconds): %lld\n", stop - start);
}
The above contains a query performance counter function Retrieves the current value of the high-resolution performance counter and query performance frequency function Retrieves the frequency of the high-resolution performance counter. If I am calling the task(); function multiple times then the difference between the start and stop time varies but I should get the same time difference for calling the task function multiple times. could anyone help me to identify the mistake in the above code ??

The thing is, Windows is a pre-emptive multi-tasking operating system. What the hell does that mean, you ask?
'Simple' - windows allocates time-slices to each of the running processes in the system. This gives the illusion of dozens or hundreds of processes running in parallel. In reality, you are limited to 2, 4, 8 or perhaps 16 parallel processes in a typical desktop/laptop. An Intel i3 has 2 physical cores, each of which can give the impression of doing two things at once. (But in reality, there's hardware tricks going on that switch the execution between each of the two threads that each core can handle at once) This is in addition to the software context switching that Windows/Linux/MacOSX do.
These time-slices are not guaranteed to be of the same duration each time. You may find the pc does a sync with windows.time to update your clock, you may find that the virus-scanner decides to begin working, or any one of a number of other things. All of these events may occur after your task() function has begun, yet before it ends.
In the DOS days, you'd get very nearly the same result each and every time you timed a single iteration of task(). Though, thanks to TSR programs, you could still find an interrupt was fired and some machine-time stolen during execution.
It is for just these reasons that a more accurate determination of the time a task takes to execute may be calculated by running the task N times, dividing the elapsed time by N to get the time per iteration.
For some functions in the past, I have used values for N as large as 100 million.
EDIT: A short snippet.
LARGE_INTEGER tStart, tEnd;
LARGE_INTEGER tFreq;
double tSecsElapsed;
QueryPerformanceFrequency(&tFreq);
QueryPerformanceCounter(&tStart);
int i, n = 100;
for (i=0; i<n; i++)
{
// Do Something
}
QueryPerformanceCounter(&tEnd);
tSecsElapsed = (tEnd.QuadPart - tStart.QuadPart) / (double)tFreq.QuadPart;
double tMsElapsed = tSecElapsed * 1000;
double tMsPerIteration = tMsElapsed / (double)n;

Code execution time on modern operating systems and processors is very unpredictable. There is no scenario where you can be sure that the elapsed time actually measured the time taken by your code, your program may well have lost the processor to another process while it was executing. The caches used by the processor play a big role, code is always a lot slower when it is executed the first time when the caches do not yet contain the code and data used by the program. The memory bus is very slow compared to the processor.
It gets especially meaningless when you measure a printf() statement. The console window is owned by another process so there's a significant chunk of process interop overhead whose execution time critically depends on the state of that process. You'll suddenly see a huge difference when the console window needs to be scrolled for example. And most of all, there isn't actually anything you can do about making it faster so measuring it is only interesting for curiosity.
Profile only code that you can improve. Take many samples so you can get rid of the outliers. Never pick the lowest measurement, that just creates unrealistic expectations. Don't pick the average either, that is affected to much by the long delays that other processes can incur on your test. The median value is a good choice.

Related

Is it acceptable to measure time elapsed by iteratively blocking a thread for a fixed period and then multiply said period by the loop count?

I work for a company that produces automatic machines, and I help maintain their software that controls the machines. The software runs on a real-time operating system, and consists of multiple threads running concurrently. The code bases are legacy, and have substantial technical debts. Among all the issues that the code bases exhibit, one stands out as being rather bizarre to me; most of the timing algorithms that involve the computation of time elapsed to realize common timed features such as timeouts, delays, recording time spent in a particular state, and etc., basically take the following form:
unsigned int shouldContinue = 1;
unsigned int blockDuration = 1; // Let's say 1 millisecond.
unsigned int loopCount = 0;
unsigned int elapsedTime = 0;
while (shouldContinue)
{
.
. // a bunch of statements, selections and function calls
.
blockingSystemCall(blockDuration);
.
. // a bunch of statements, selections and function calls
.
loopCount++;
elapsedTime = loopCount * blockDuration;
}
The blockingSystemCall function can be any operating system's API that suspends the current thread for the specified blockDuration. The elapsedTime variable is subsequently computed by basically multiplying loopCount by blockDuration or by any equivalent algorithm.
To me, this kind of timing algorithm is wrong, and is not acceptable under most circumstances. All the instructions in the loop, including the condition of the loop, are executed sequentially, and each instruction requires measurable CPU time to execute. Therefore, the actual time elapsed is strictly greater than the value of elapsedTime in any given instance after the loop starts. Consequently, suppose the CPU time required to execute all the statements in the loop, denoted by d, is constant. Then, elapsedTime lags behind the actual time elapsed by loopCount • d for any loopCount > 0; that is, the deviation grows according to an arithmetic progression. This sets the lower bound of the deviation because, in reality, there will be additional delays caused by thread scheduling and time slicing, depending on other factors.
In fact, not too long ago, while testing a new data-driven predictive maintenance feature which relies on the operation time of a machine, we discovered that the operation time reported by the software lagged behind that of a standard reference clock by a whopping three hours after the machine was in continuous operation for just over two days. It was through this test that I discovered the algorithm outlined above, which I swiftly determined to be the root cause.
Coming from a background where I used to implement timing algorithms on bare-metal systems using timer interrupts, which allows the CPU to carry on with the execution of the business logic while the timer process runs in parallel, it was shocking for me to have discovered that the algorithm outlined in the introduction is used in the industry to compute elapsed time, even more so when a typical operating system already encapsulates the timer functions in the form of various easy-to-use public APIs, liberating the programmer from the hassle of configuring a timer via hardware registers, raising events via interrupt service routines, etc.
The kind of timing algorithm as illustrated in the skeleton code above is found in at least two code bases independently developed by two distinct software engineering teams from two subsidiary companies located in two different cities, albeit within the same state. This makes me wonder whether it is how things are normally done in the industry or it is just an isolated case and is not widespread.
So, the question is, is the algorithm shown above common or acceptable in calculating elapsed time, given that the underlying operating system already provides highly optimized time-management system calls that can be used right out of the box to accurately measure elapsed time or even used as basic building blocks for creating higher-level timing facilities that provide more intuitive methods similar to, e.g., the Timer class in C#?

You're right that calculating elapsed time that way is inaccurate -- since it assumes that the blocking call will take exactly the amount of time indicated, and that everything that happens outside of the blocking system call will take no time at all, which would only be true on an infinitely-fast machine. Since actual machines are not infinitely fast, the elapsed-time calculated this way will always be somewhat less than the actual elapsed time.
As to whether that's acceptable, it's going to depend on how much timing accuracy your program needs. If it's just doing a rough estimate to make sure a function doesn't run for "too long", this might be okay. OTOH if it is trying for accuracy (and in particular accuracy over a long period of time), then this approach won't provide that.
FWIW the more common (and more accurate) way to measure elapsed time would be something like this:
const unsigned int startTime = current_clock_time();
while (shouldContinue)
{
loopCount++;
elapsedTime = current_clock_time() - startTime;
}
This has the advantage of not "drifting away" from the accurate value over time, but it does assume that you have a current_clock_time() type of function available, and that it's acceptable to call it within the loop. (If current_clock_time() is very expensive, or doesn't provide some real-time performance guarantees that the calling routine requires, that might be a reason not to do it this way)

I don't think these loops do what you think they do.
In a RTOS, the purpose of a loop like this is usually to perform a task at regular intervals.
blockingSystemCall(N) probably does not just sleep for N milliseconds like you think it does. It probably sleeps until N milliseconds after the last time your thread woke up.
More accurately, all the sleeps your thread has performed since starting are added to the thread start time to get the time at which the OS will try to wake the thread up. If your thread woke up due to an I/O event, then the last one of those times could be used instead of the thread start time. The point is that the inaccuracies in all these start times are corrected, so your thread wakes up at regular intervals and the elapsed time measurement is perfectly accurate according to the RTOS master clock.
There could also be very good reasons for measuring elapsed time by the RTOS master clock instead of a more accurate wall clock time, in addition to simplicity. This is because all of the guarantees that an RTOS provides (which is the reason you are using a RTOS in the first place) are provided in that time scale. The amount of time taken by one task can affect the amount of time you are guaranteed to have available for other tasks, as measured by this clock.
It may or may not be a problem that your RTOS master clock runs slow by 3 hours every 2 days...

Clock isn't accurate

I'm assigning the value of unistd.h's clock() to two int types, as follows:
int start_time = clock();
for (i = 0; i < 1000000; i++) {
printf("%d\n", i+1);
}
int end_time = clock();
However, when I print their values, the actual time elapsed differs from the time displayed. The POSIX standard declares that CLOCKS_PER_SEC must equal one million, assuming that a clock cycle is a microsecond. Is the clock just not going the speed the standard expects, or is my loop causing some weirdness in the calculation?
I'm trying to measure the speed of different operations in a similar fashion, and an inaccurate clock ruins my experiments.

It takes several seconds to print from 1 to 1,000,000. The value assigned to end_time is usually around 900,000, which in microseconds
Your processor is fast. Your I/O is not so much.
When your code is executed, your processor will get the time using hardware and assign it to start_time. Then, it will go through the loop and put 1 million lines in the output buffer. Putting things in output buffer does not mean processor has done displaying them.
That is why you get less than a second to finish the process, but see the output for several seconds.
Edit: Just to clarify, it seems that the phrase "Putting things in output buffer does not mean processor has done displaying them" has introduced confusion. This is written in the perspective of the executing process, and does not mean that processor puts all the output in the output buffer in one go.
What actually happens (as n.m. has pointed out) is this: the clock() actually returns the process time, not the time of the day. And because the output buffer is small and processor has to wait long waits after flushing the buffer, the process time will be significantly smaller than the actual execution time. Hence, by the perspective of the process it looks like execution happens fast but display of output is very slow.

Generic Microcontroller Delay Function

Come someone please tell me how this function works? I'm using it in code and have an idea how it works, but I'm not 100% sure exactly. I understand the concept of an input variable N incrementing down, but how the heck does it work? Also, if I am using it repeatedly in my main() for different delays (different iputs for N), then do I have to "zero" the function if I used it somewhere else?
Reference: MILLISEC is a constant defined by Fcy/10000, or system clock/10000.
Thanks in advance.
// DelayNmSec() gives a 1mS to 65.5 Seconds delay
/* Note that FCY is used in the computation. Please make the necessary
Changes(PLLx4 or PLLx8 etc) to compute the right FCY as in the define
statement above. */
void DelayNmSec(unsigned int N)
{
unsigned int j;
while(N--)
for(j=0;j < MILLISEC;j++);
}

This is referred to as busy waiting, a concept that just burns some CPU cycles thus "waiting" by keeping the CPU "busy" doing empty loops. You don't need to reset the function, it will do the same if called repeatedly.
If you call it with N=3, it will repeat the while loop 3 times, every time counting with j from 0 to MILLISEC, which is supposedly a constant that depends on the CPU clock.

The original author of the code have timed and looked at the assembler generated to get the exact number of instructions executed per Millisecond, and have configured a constant MILLISEC to match that for the for loop as a busy-wait.
The input parameter N is then simply the number of milliseconds the caller want to wait and the number of times the for-loop is executed.
The code will break if
used on a different or faster micro controller (depending on how Fcy is maintained), or
the optimization level on the C compiler is changed, or
c-compiler version is changed (as it may generate different code)
so, if the guy who wrote it is clever, there may be a calibration program which defines and configures the MILLISEC constant.

This is what is known as a busy wait in which the time taken for a particular computation is used as a counter to cause a delay.
This approach does have problems in that on different processors with different speeds, the computation needs to be adjusted. Old games used this approach and I remember a simulation using this busy wait approach that targeted an old 8086 type of processor to cause an animation to move smoothly. When the game was used on a Pentium processor PC, instead of the rocket majestically rising up the screen over several seconds, the entire animation flashed before your eyes so fast that it was difficult to see what the animation was.
This sort of busy wait means that in the thread running, the thread is sitting in a computation loop counting down for the number of milliseconds. The result is that the thread does not do anything else other than counting down.
If the operating system is not a preemptive multi-tasking OS, then nothing else will run until the count down completes which may cause problems in other threads and tasks.
If the operating system is preemptive multi-tasking the resulting delays will have a variability as control is switched to some other thread for some period of time before switching back.
This approach is normally used for small pieces of software on dedicated processors where a computation has a known amount of time and where having the processor dedicated to the countdown does not impact other parts of the software. An example might be a small sensor that performs a reading to collect a data sample then does this kind of busy loop before doing the next read to collect the next data sample.

How to measure the ACTUAL execution time of a C program under Linux?

I know this question may have been commonly asked before, but it seems most of those questions are regarding the elapsed time (based on wall clock) of a piece of code. The elapsed time of a piece of code is unlikely equal to the actual execution time, as other processes may be executing during the elapsed time of the code of interest.
I used getrusage() to get the user time and system time of a process, and then calculate the actual execution time by (user time + system time). I am running my program on Ubuntu. Here are my questions:
How do I know the precision of getrusage()?
Are there other approaches that can provide higher precision than getrusage()?

You can check the real CPU time of a process on linux by utilizing the CPU Time functionality of the kernel:
#include <time.h>
clock_t start, end;
double cpu_time_used;
start = clock();
... /* Do the work. */
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
Source: http://www.gnu.org/s/hello/manual/libc/CPU-Time.html#CPU-Time
That way, you count the CPU ticks or the real amount of instructions worked upon by the CPU on the process, thus getting the real amount of work time.

The getrusage() function is the only standard/portable way that I know of to get "consumed CPU time".
There isn't a simple way to determine the precision of returned values. I'd be tempted to call the getrusage() once to get an initial value, and the call it repeatedly until the value/s returned are different from the initial value, and then assume the effective precision is the difference between the initial and final values. This is a hack (it would be possible for precision to be higher than this method determines, and the result should probably be considered a worst case estimate) but it's better than nothing.
I'd also be concerned about the accuracy of the values returned. Under some kernels I'd expect that a counter is incremented for whatever code happens to be running when a timer IRQ occurs; and therefore it's possible for a process to be very lucky (and continually block just before the timer IRQ occurs) or very unlucky (and unblock just before the timer IRQ occurs). In this case "lucky" could mean a CPU hog looks like it uses no CPU time, and "unlucky" could means a process that uses very little CPU time looks like a CPU hog.
For specific versions of specific kernels on specific architecture/s (potentially depending on if/when the kernel is compiled with specific configuration options in some cases), there may be higher precision alternatives that aren't portable and aren't standard...

You can use this piece of code :
#include <sys/time.h>
struct timeval start, end;
gettimeofday(&start, NULL);
.
.
.
gettimeofday(&end, NULL);
delta = ((end.tv_sec - start.tv_sec) * 1000000u +
end.tv_usec - start.tv_usec) / 1.e6;
printf("Time is : %f\n",delta);
It will show you the execution time for piece of your code

CPU Cycle count based profiling in C/C++ Linux x86_64

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?

I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.

Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.
The right way is either:
Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
Use a cycle-level emulating profiler like Callgrind / kcachegrind.
By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.

You are on the right track1, but you need to do two things:
Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.
1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.

Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.
If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.
That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.

The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.

Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?
I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.
Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?
If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight