Measure CPU frequency (x86 / x64) - c

I'm looking for some kind of a library that gives me accurate CPU frequency values periodically on both Intel and AMD processors, on 32-bit and 64-bit Windows.
The purpose of this is to accuratly measure CPU load on a given computer. The problem is that calling QueryPerformanceCounter() returns clock ticks (used to measure the duration of an activity) but the underlying CPU frequency is not constant because of SpeedStep or TurboBoost. I've found several computers where turning off SpeedStep / TurboBoost in the BIOS and doesn't prevent CPU frequency scaling based on load.
I'm trying to see if there are any libraries available that could be used to detect CPU frequency changes (much like how Throttlestop / CPU-Z or even the Overview tab of Resource Monitor in Windows 7) so that I could query and save this information along with my other measurements. Performance counters don't seem to return reliable information, as I have computers that always return 100% CPU frequency, even when other tools show dynamic frequency changes.
I searched for such libraries but most results come back with gadgets, etc., that are not useful.

You can combine a high-resolution timer with a clock cycle counter to compute the current clock rate. On modern CPUs, the cycle counter can be read with this function:
static inline uint64_t get_cycles()
{
uint64_t t;
asm volatile ("rdtsc" : "=A"(t));
return t;
}
Note that this is per CPU, so if your program gets moved around CPUs, you're in trouble. If you know about CPU pinning techniques on your platform, you might like to try those.
For high resolution time measurement, you can use the tools in <chrono>; here's a semi-useful post of mine on the topic.

Try to focus on what you are trying to do, and not on how to do it.
What is your ultimate goal?
If, as you say, you are trying to "measure CPU load on a given computer", on Windows it may be a good practice using "PdhOpenQuery" and the "Pdh*" family functions.
See this SO answer as well:
How to determine CPU and memory consumption from inside a process?

Consider looking at the __rdtsc intrinsic function (#include "intrin.h" in Visual Studio).
This yields the clock count directly from the processor via the x86/x64 function RDTSC (Read Timestamp).

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!
Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?
Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.
The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.
If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!
You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.
The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

Throughput calculation using cycle count

Is it possible to determine the throughput of an application on a processor from the cycle counts (Processor instruction cycles) consumed by the application ? If yes, how to calculate it ?
If the process is entirely CPU bound, then you divide the processor speed by the number of cycles to get the throughput.
In reality, few processes are entirely CPU bound though, in which case you have to take other factors (disk speed, memory speed, serialization, etc.) into account.
Simple:
#include <time.h>
clock_t c;
c = clock(); // c holds clock ticks value
c = c / CLOCKS_PER_SEC; // real time, if you need it
Note that the value you get is an approximation, for more info see the clock() man page.
Some CPUs have internal performance registers which enable you to collect all sorts of interesting statistics, such as instruction cycles (sometimes even on a per execution unit basis), cache misses, # of cache/memory reads/writes, etc. You can access these directly, but depending on what CPU and OS you are using there may well be existing tools which manage all the details for you via a GUI. Often a good profiling tool will have support for performance registers and allow you to collect statistics using them.
If you use the Cortex-M3 from TI/Luminary Micro, you can make use of the driverlib delivered by TI/Luminary Micro.
Using the SysTick functions you can set the SysTickPeriod to 1 processor cycle: So you have 1 processor clock between interrupts. By counting the number of interrupts you should get a "near enough estimation" on how much time a function or function block take.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources