I am profiling some code on three different computers with three different frequencies. I need the frequencies to measure GFLOPs/s. I have some code which does this but it does not account for Turboboost. For example on my 2600k CPU it reports 3.4 GHz but I can see when I run CPUz that my CPU is running at 4.3 GHz (overclocked) for my code which uses all cores.
#include "stdint.h"
#include "stdio.h"
#include "omp.h"
int main() {
int64_t cycles = rdtsc(); double dtime = omp_get_wtime();
//run some code which uses all cores for a while (few ms)
dtime = omp_get_wtime() - dtime;
cycles = rdtsc() - cycles;
double freq = (double)cycles/dtime*1E-9;
printf("freq %.2f GHz\n", freq);
}
__int64 rdtsc() {
#ifdef _WIN32
return __rdtsc();
#else
uint64_t t;
asm volatile ("rdtsc" : "=A"(t));
return t;
#endif
}
I know this question has been asked various times with various answers but it's still not clear to me if this can be done. I don't care about hackers trying to change timers. This code is only for myself. Is it possible to get the actual frequency in code? How is this done on Linux? Every example I have found on linux gives the base frequency (or maybe max) but not the operating frequency under load like CPUz does.
Edit:
I found a program, Powertop, for Linux which appears to show the actual operating frequency. Since the source code is available maybe it's possible to figure out how to get the actual frequency in my own code.
I finally solved this problem. It is possible to measure the actuall operating frequency in code without needing device drivers or reading special counters.
Basically you time a loop for an operation with a carried loop dependency which always takes the same latency. For example
for(int i=0; i<spinCount; i++) {
x = _mm_add_ps(x,_mm_set1_ps(1.0f));
}
You run this loop in threads bound to each physical core (not logical) core. The requires that no other threads in the system then these ones take any significant CPU time so this method won't always give the correct answer but in my case it works quite well. I get results that deviate less than 0.5% from the correct turbo frequency for one thread and for many threads on Nahalem, Ivy Bridge, and Haswell on single socket system and multi-socket system. I described the details of this at how-can-i-programmatically-find-the-cpu-frequency-with-c so I won't repeat all the details here.
Related
What's the relationship between the real CPU frequency and the clock_t (the unit is clock tick) in C?
Let's say I have the below piece of C code which measures the time that CPU consumed for running a for loop.
But since the CLOCKS_PER_SEC is a constant value (basically 1000,000) in the C standard library, I wonder how the clock function does measure the real CPU cycles that are consumed by the program while it runs on different computers with different CPU frequencies (for my laptop, it is 2.6GHz).
And if they are not relevant, how does the CPU timer work in the mentioned scenario?
#include <time.h>
#include <stdio.h>
int main(void) {
clock_t start_time = clock();
for(int i = 0; i < 10000; i++) {}
clock_t end_time = clock();
printf("%fs\n", (double)(end_time - start_time) / CLOCKS_PER_SEC);
return 0;
}
Effectively, clock_t values are unrelated to the CPU frequency.
See longer explanation here.
While clock_t-type values could have, in theory, represented actual physical CPU clock ticks - in practice, they do not: POSIX mandates that CLOCKS_PER_SEC be equal to 1,000,000 - one million. Thus the clock_t function returns a value in microseconds.
There is no such thing as "the real CPU frequency". Not in everyone's laptop, at any rate.
On many systems, the OS can lower and raise the CPU clock speed as it sees fit. On some systems there is more than one kind of central processor or core, each with a different speed. Some CPUs are clockless (asynchronous).
Because of all this and for other reasons, most computers measure time with a separate clock device, independent from the CPU clock (if any).
For providing the information used in the shown code, measuring/knowing/using the CPU cycles is not relevant.
For providing the elapsed time, it is only necessary to measure the time.
Reading a hardware timer would be one way to do so.
Most computers (even non-embedded ones) do contain timers which are especially counting ticks of a clock with known constant frequency. (They are specifically not "CPU timers".)
Such a timer can be read and yields a value which increases once per tick (of constant period). Where "known period" means a period know to some appropriate driver for that timer, simplified "known to the clock() function, not necessarily known to you".
Note that even if the number of used CPU cycles were known, calculating the elapsed time from that info is near impossible nowadays, in the presence of:
pipelines
parallelisms
interrupts
branch prediction
More things influencing/preventing the calculation, from comment contribution:
frequency-scaling, temperature throttling and power settings
(David C. Rankin)
#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// assuming we return times with microsecond resolution
#define STOPWATCH_TICKS_PER_US 1
uint64_t GetStopWatch()
{
LARGE_INTEGER t, freq;
uint64_t val;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&freq);
return (uint64_t) (t.QuadPart / (double) freq.QuadPart * 1000000);
}
void task()
{
printf("hi\n");
}
int main()
{
uint64_t start = GetStopWatch();
task();
uint64_t stop = GetStopWatch();
printf("Elapsed time (microseconds): %lld\n", stop - start);
}
The above contains a query performance counter function Retrieves the current value of the high-resolution performance counter and query performance frequency function Retrieves the frequency of the high-resolution performance counter. If I am calling the task(); function multiple times then the difference between the start and stop time varies but I should get the same time difference for calling the task function multiple times. could anyone help me to identify the mistake in the above code ??
The thing is, Windows is a pre-emptive multi-tasking operating system. What the hell does that mean, you ask?
'Simple' - windows allocates time-slices to each of the running processes in the system. This gives the illusion of dozens or hundreds of processes running in parallel. In reality, you are limited to 2, 4, 8 or perhaps 16 parallel processes in a typical desktop/laptop. An Intel i3 has 2 physical cores, each of which can give the impression of doing two things at once. (But in reality, there's hardware tricks going on that switch the execution between each of the two threads that each core can handle at once) This is in addition to the software context switching that Windows/Linux/MacOSX do.
These time-slices are not guaranteed to be of the same duration each time. You may find the pc does a sync with windows.time to update your clock, you may find that the virus-scanner decides to begin working, or any one of a number of other things. All of these events may occur after your task() function has begun, yet before it ends.
In the DOS days, you'd get very nearly the same result each and every time you timed a single iteration of task(). Though, thanks to TSR programs, you could still find an interrupt was fired and some machine-time stolen during execution.
It is for just these reasons that a more accurate determination of the time a task takes to execute may be calculated by running the task N times, dividing the elapsed time by N to get the time per iteration.
For some functions in the past, I have used values for N as large as 100 million.
EDIT: A short snippet.
LARGE_INTEGER tStart, tEnd;
LARGE_INTEGER tFreq;
double tSecsElapsed;
QueryPerformanceFrequency(&tFreq);
QueryPerformanceCounter(&tStart);
int i, n = 100;
for (i=0; i<n; i++)
{
// Do Something
}
QueryPerformanceCounter(&tEnd);
tSecsElapsed = (tEnd.QuadPart - tStart.QuadPart) / (double)tFreq.QuadPart;
double tMsElapsed = tSecElapsed * 1000;
double tMsPerIteration = tMsElapsed / (double)n;
Code execution time on modern operating systems and processors is very unpredictable. There is no scenario where you can be sure that the elapsed time actually measured the time taken by your code, your program may well have lost the processor to another process while it was executing. The caches used by the processor play a big role, code is always a lot slower when it is executed the first time when the caches do not yet contain the code and data used by the program. The memory bus is very slow compared to the processor.
It gets especially meaningless when you measure a printf() statement. The console window is owned by another process so there's a significant chunk of process interop overhead whose execution time critically depends on the state of that process. You'll suddenly see a huge difference when the console window needs to be scrolled for example. And most of all, there isn't actually anything you can do about making it faster so measuring it is only interesting for curiosity.
Profile only code that you can improve. Take many samples so you can get rid of the outliers. Never pick the lowest measurement, that just creates unrealistic expectations. Don't pick the average either, that is affected to much by the long delays that other processes can incur on your test. The median value is a good choice.
I'm working on a timing system and I'll implement a timer class.
#include <windows.h>
#include <stdio.h>
#include <time.h>
int main()
{
clock_t t1, t2;
t1 = clock();
Sleep(10);
t2 = clock();
printf("%i\n", (int)(t2 - t1));
return 0;
}
This program should print "10" but it prints "15" or "16". I need more accurate which is less than 1 ms! Suggestions? (maybe with select()'s timeout?)
NOTE: I've run this program on Windows 7 Ultimate x86. Program compiled with MinGW (C/C++) x86.
NOW I THINK >>
Sleep() is accurate to the operating system's clock interrupt rate. Which by default on Windows ticks 64 times per second. Or once every 15.625 msec, as you found out.
You can increase that rate, call timeBeginPeriod(10). Use timeEndPeriod(10) when you're done. You are still subject to normal thread scheduling latencies so you still don't have a guarantee that your thread will resume running after 10 msec. And won't when the machine is heavily loaded. Using SetThreadPriority() to boost the priority, increasing the odds that it will.
Your problem is that the "clock" only ticks 64 times per second (I've seen systems that can go up to 2000, but no higher). If you are creating a timer class, you probably want to have much higher resolution than even 2000. Use QueryPerformanceCounter to get higher resolution. See QueryPerformanceCounter and overflows for an example.
Note that if you want to sleep for very short intervals, you will have to call QueryPerformanceCounter in a tight loop.
Sleep() is not accurate in the way you want it to be.
It will cause the thread to sleep for AT LEAST the length of time you specify, but there's no guarantee that the OS will return control to your thread at exactly that specified time.
I need to do precision timing to the 1 us level to time a change in duty cycle of a pwm wave.
Background
I am using a Gumstix Over Water COM (https://www.gumstix.com/store/app.php/products/265/) that has a single core ARM Cortex-A8 processor running at 499.92 BogoMIPS (the Gumstix page claims up to 1Ghz with 800Mhz recommended) according to /proc/cpuinfo. The OS is an Angstrom Image version of Linux based of kernel version 2.6.34 and it is stock on the Gumstix Water COM.
The Problem
I have done a fair amount of reading about precise timing in Linux (and have tried most of it) and the consensus seems to be that using clock_gettime() and referencing CLOCK_MONOTONIC is the best way to do it. (I would have liked to use the RDTSC register for timing since I have one core with minimal power saving abilities but this is not an Intel processor.) So here is the odd part, while clock_getres() returns 1, suggesting resolution at 1 ns, actual timing tests suggest a minimum resolution of 30517ns or (it can't be coincidence) exactly the time between a 32.768KHz clock ticks. Here's what I mean:
// Stackoverflow example
#include <stdio.h>
#include <time.h>
#define SEC2NANOSEC 1000000000
int main( int argc, const char* argv[] )
{
// //////////////// Min resolution test //////////////////////
struct timespec resStart, resEnd, ts;
ts.tv_sec = 0; // s
ts.tv_nsec = 1; // ns
int iters = 100;
double resTime,sum = 0;
int i;
for (i = 0; i<iters; i++)
{
clock_gettime(CLOCK_MONOTONIC, &resStart); // start timer
// clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, &ts);
clock_gettime(CLOCK_MONOTONIC, &resEnd); // end timer
resTime = ((double)resEnd.tv_sec*SEC2NANOSEC + (double)resEnd.tv_nsec
- ((double)resStart.tv_sec*SEC2NANOSEC + (double)resStart.tv_nsec);
sum = sum + resTime;
printf("resTime = %f\n",resTime);
}
printf("Average = %f\n",sum/(double)iters);
}
(Don't fret over the double casting, tv_sec in a time_t and tv_nsec is a long.)
Compile with:
gcc soExample.c -o runSOExample -lrt
Run with:
./runSOExample
With the nanosleep commented out as shown, the result is either 0ns or 30517ns with the majority being 0ns. This leads me to believe that CLOCK_MONOTONIC is updated at 32.768kHz and most of the time the clock has not been updated before the second clock_gettime() call is made and in cases where the result is 30517ns the clock has been updated between calls.
When I do the same thing on my development computer (AMD FX(tm)-6100 Six-Core Processor running at 1.4 GHz) the minimum delay is a more constant 149-151ns with no zeros.
So, let's compare those results to the CPU speeds. For the Gumstix, that 30517ns (32.768kHz) equates to 15298 cycles of the 499.93MHz cpu. For my dev computer that 150ns equates to 210 cycles of the 1.4Ghz CPU.
With the clock_nanosleep() call uncommented the average results are these:
Gumstix: Avg value = 213623 and the result varies, up and down, by multiples of that min resolution of 30517ns
Dev computer: 57710-68065 ns with no clear trend. In the case of the dev computer I expect the resolution to actually be at the 1 ns level and the measured ~150ns truly is the time elapsed between the two clock_gettime() calls.
So, my question's are these:
What determines that minimum resolution?
Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
Thanks! If you need more info or clarification just ask.
As I understand, the difference between two environments(Gumstix and your Dev-computer) might be the underlying timer h/w they are using.
Commented nanosleep() case:
You are using clock_gettime() twice. To give you a rough idea of what this clock_gettime() will ultimately get mapped to(in kernel):
clock_gettime -->clock_get() -->posix_ktime_get_ts -->ktime_get_ts() -->timekeeping_get_ns()
-->clock->read()
clock->read() basically reads the value of the counter provided by underlying timer driver and corresponding h/w. A simple difference with stored value of the counter in the past and current counter value and then nanoseconds conversion mathematics will yield you the nanoseconds elapsed and will update the time-keeping data structures in kernel.
For example , if you have a HPET timer which gives you a 10 MHz clock, the h/w counter will get updated at 100 ns time interval.
Lets say, on first clock->read(), you get a counter value of X.
Linux Time-keeping data structures will read this value of X, get the difference 'D'compared to some old stored counter value.Do some counter-difference 'D' to nanoseconds 'n' conversion mathematics, update the data-structure by 'n'
Yield this new time value to the user space.
When second clock->read() is issued, it will again read the counter and update the time.
Now, for a HPET timer, this counter is getting updated every 100ns and hence , you will see this difference being reported to the user-space.
Now, Let's replace this HPET timer with a slow 32.768 KHz clock. Now , clock->read()'s counter will updated only after 30517 ns seconds, so, if you second call to clock_gettime() is before this period, you will get 0(which is majority of the cases) and in some cases, your second function call will be placed after counter has incremented by 1, i.e 30517 ns has elapsed. Hence , the value of 30517 ns sometimes.
Uncommented Nanosleep() case:
Let's trace the clock_nanosleep() for monotonic clocks:
clock_nanosleep() -->nsleep --> common_nsleep() -->hrtimer_nanosleep() -->do_nanosleep()
do_nanosleep() will simply put the current task in INTERRUPTIBLE state, will wait for the timer to expire(which is 1 ns) and then set the current task in RUNNING state again. You see, there are lot of factors involved now, mainly when your kernel thread (and hence the user space process) will be scheduled again. Depending on your OS, you will always face some latency when your doing a context-switch and this is what we observe with the average values.
Now Your questions:
What determines that minimum resolution?
I think the resolution/precision of your system will depend on the underlying timer hardware being used(assuming your OS is able to provide that precision to the user space process).
*Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?*
Sorry, I missed you here. How it is 30000x faster? To me , it looks like something 200x faster(30714 ns/ 150 ns ~ 200X ? ) .But anyway, as I understand, CPU speed may or may not have to do with the timer resolution/precision. So, this assumption may be right in some architectures(when you are using TSC H/W), though, might fail in others(using HPET, PIT etc).
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
you can always look into the kernel code for details(that's how i looked into it).
In linux kernel code , look for these source files and Documentation:
kernel/posix-timers.c
kernel/hrtimer.c
Documentation/timers/hrtimers.txt
I do not have gumstix on hand, but it looks like your clocksource is slow.
run:
$ dmesg | grep clocksource
If you get back
[ 0.560455] Switching to clocksource 32k_counter
This might explain why your clock is so slow.
In the recent kernels there is a directory /sys/devices/system/clocksource/clocksource0 with two files: available_clocksource and current_clocksource. If you have this directory, try switching to a different source by echo'ing its name into second file.
I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?
I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.
Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.
The right way is either:
Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
Use a cycle-level emulating profiler like Callgrind / kcachegrind.
By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.
You are on the right track1, but you need to do two things:
Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.
1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.
Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.
If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.
That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.
The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.
Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?
I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.
Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?
If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}