What is this method of timing code on Intel processors, referred to by Agnor fod as "core clock cycles":
http://gcc.gnu.org/ml/gcc/2008-07/msg00424.html
My test results, referred to above, uses the "core clock cycles"
performance counter on Intel and RDTSC on AMD. It's the highest
resolution you can get.
I always thought RDTSC was the best way to measure on Intel CPUs too? What is this other technique he speaks about and how do you measure using it?
He is referring to the Intel PCM, the Performance Counter Monitor.
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
By the way in the original post linked article, http://www.agner.org/optimize/optimizing_cpp.pdf
the poster explicity refers to it:
The time stamp counter is a little inaccurate on microprocessors that can change the clock frequency (Intel SpeedStepĀ® technology). A more accurate measurement can be obtained with a performance monitor counter for "core clock cycles", using the test program mentioned above
Related
Intro:
I have written a Linux kernel module for performance counter monitoring on an ARM v7 platform with cortex A-15 and A-7 processors (Odroid XU3). One counter I am trying to use in my research is cycle counts, which from the ARM technical reference manuals has its own dedicated counter. I have checked my code against other implementations and ARM references found online; here is a snippet of the part that enables the CPU counters:
Resources Used:
How to measure program execution time in ARM Cortex-A8 processor?
http://neocontra.blogspot.se/2013/05/user-mode-performance-counters-for.html
https://pietrotech.wordpress.com/2016/09/28/sample-performance-counters-on-little-and-big-cluster-on-odroid-xu3-processor-exynos-5422/
ARM Reference Manual (Ch. 11, PMU)
Problem:
When I print the cycles elapsed over a fixed sampling period (100ms) for a fixed CPU frequency (1.4GHz in the case of core 0), I see a huge amount of variance in the values returned by the module. See the chart below for an example of this. Not only does the variance seem very high, but the number of cycles measured does not reflect the number of cycles I would expect to see recorded given the sample time and fixed frequency (for the given scenario I expected 1.4e8 cycles on each sample). What could be causing such divergence from the expected number of cycles?
Variability of measured cycles for kernel module running across all cores and across just core 0.
After further though and discussions with colleagues, I believe the discrepancy between measured and expected cycles is cpuidle: it is a subsystem in the Linux kernel that places a CPU core into a lower-power state when the core is not doing anything. Some of the lowest states shut down the clock, which likely causes the cycle counter to stop incrementing. This article gives a nice description of cpuidle and how it works: https://lwn.net/Articles/384146/
I am trying to implement my own version of clock() using asm and rdtsc. However I am quite unsure about its return value. Is it cycles? Oder is it micro seconds?
I am also confused about CLOCKS_PER_SEC. How can this be constant?
Is there any kind of formula which sets these values into relation?
You can find a rdtsc reference implementation here:
https://github.com/LITMUS-RT/liblitmus/blob/master/arch/x86/include/asm/cycles.h
TSC counts the number of cycles since reset. If you need a time value unit in seconds, you also need to read the CPU clock frequency and divide TSC value by frequency. However, this may not be accurate if the CPU frequency scaling is enabled. Recent Intel processors include a constant rate TSC (identified by the "constant_tsc" flag in Linux's /proc/cpuinfo). With these processors, the TSC ticks at the processor's nominal frequency, regardless of the actual CPU clock frequency due to turbo or power saving states.
https://en.wikipedia.org/wiki/Time_Stamp_Counter
As far as I understand to measure the actual operating CPU frequency I need access to the model specific registers (MSR) IA32_APERF and IA32_MPERF (Assembly CPU frequency measuring algorithm).
However, access to the MSR registers is privileged (through the rdmsr instruction). Is there another way this can be done? I mean, for example, through a device driver/library which I could call in my code. It seems strange to me that reading the registers is privileged. I would think only writing to them would be privileged.
Note: the rdtsc instruction does not account for turbo boost and thus cannot report the actual operating frequency
Edit:
I'm interested in solutions for Linux and/or Windows.
You are right, the proper way to find average cpu frequency described in 2nd answer in your link.
To read msrs on linux you can use tool RDMSR.
The only thing that maybe missleading in that answer, is maxfrequency. It should be not maxfrequency, but nominal frequency (max non-turbo frequency), as MPERF counter counts in max-non turbo frequency. You can get this frequency from MSR 0xCE bits 8:15 (ref)
In Linux world, to get nano seconds precision timer/clockticks one can use :
#include <sys/time.h>
int foo()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
//--snip--
}
This answer suggests an asm approach to directly query for the cpu clock with the RDTSC instruction.
In a multi-core, multi-processor architecture, how is this clock ticks/timer value synchronized across multiple cores/processors? My understanding is that there in inherent fencing being done. Is this understanding correct?
Can you suggest some documentation that would explain this in detail? I am interested in Intel Nehalem and Sandy Bridge microarchitectures.
EDIT
Limiting the process to a single core or cpu is not an option as the process is really huge(in terms of resources consumed) and would like to optimally utilize all the resources in the machine that includes all the cores and processors.
Edit
Thanks for the confirmation that the TSC is synced across cores and processors. But my original question is how is this synchronization done ? is it with some kind of fencing ? do you know of any public documentation ?
Conclusion
Thanks for all the inputs: Here's the conclusion for this discussion: The TSCs are synchronized at the initialization using a RESET that happens across the cores and processors in a multi processor/multi core system. And after that every Core is on their own. The TSCs are kept invariant with a Phase Locked Loop that would normalize the frequency variations and thus the clock variations within a given Core and that is how the TSC remain in sync across cores and processors.
Straight from Intel, here's an explanation of how recent processors maintain a TSC that ticks at a constant rate, is synchronous between cores and packages on a multi-socket motherboard, and may even continue ticking when the processor goes into a deep sleep C-state, in particular see the explanation by Vipin Kumar E K (Intel):
http://software.intel.com/en-us/articles/best-timing-function-for-measuring-ipp-api-timing/
Here's another reference from Intel discussing the synchronization of the TSC across cores, in this case they mention the fact that rdtscp allows you to read both the TSC and the processor id atomically, this is important in tracing applications... suppose you want to trace the execution of a thread that might migrate from one core to another, if you do that in two separate instructions (non-atomic) then you don't have certainty of which core the thread was in at the time it read the clock.
http://software.intel.com/en-us/articles/intel-gpa-tip-cannot-sychronize-cpu-timestamps/
All sockets/packages on a motherboard receive two external common signals:
RESET
Reference CLOCK
All sockets see RESET at the same time when you power the motherboard, all processor packages receive a reference clock signal from an external crystal oscillator and the internal clocks in the processor are kept in phase (although usually with a high multiplier, like 25x) with circuitry called a Phase Locked Loop (PLL). Recent processors will clock the TSC at the highest frequency (multiplier) that the processor is rated (so called constant TSC), regardless of the multiplier that any individual core may be using due to temperature or power management throttling (so called invariant TSC). Nehalem processors like the X5570 released in 2008 (and newer Intel processors) support a "Non-stop TSC" that will continue ticking even when conserving power in a deep power down C-state (C6). See this link for more information on the different power down states:
http://www.anandtech.com/show/2199
Upon further research I came across a patent Intel filed on 12/22/2009 and was published on 6/23/2011 entitled "Controlling Time Stamp Counter (TSC) Offsets For Mulitple Cores And Threads"
http://www.freepatentsonline.com/y2011/0154090.html
Google's page for this patent application (with link to USPTO page)
http://www.google.com/patents/US20110154090
From what I gather there is one TSC in the uncore (the logic in a package surrounding the cores but not part of any core) which is incremented on every external bus clock by the value in the field of the machine specific register specified by Vipin Kumar in the link above (MSR_PLATFORM_INFO[15:8]). The external bus clock runs at 133.33MHz. In addition each core has it's own TSC register, clocked by a clock domain that is shared by all cores and may be different from the clock for any one core - therefore there must be some kind of buffer when the core TSC is read by the RDTSC (or RDTSCP) instruction running in a core. For example, MSR_PLATFORM_INFO[15:8] may be set to 25 on a package, every bus clock the uncore TSC increments by 25, there is a PLL that multiplies the bus clock by 25 and provides this clock to each of the cores to clock their local TSC register, thereby keeping all TSC registers in synch. So to map the terminology to actual hardware
Constant TSC is implemented by using the external bus clock running at 133.33 MHz which is multiplied by a constant multiplier specified in MSR_PLATFORM_INFO[15:8]
Invariant TSC is implemented by keeping the TSC in each core on a separate clock domain
Non-stop TSC is implemented by having an uncore TSC that is incremented by MSR_PLATFORM_INFO[15:8] ticks on every bus clock, that way a multi-core package can go into deep power down (C6 state) and can shutdown the PLL... there is no need to keep a clock at the higher multiplier. When a core is resumed from C6 state its internal TSC will get initialized to the value of the uncore TSC (the one that didn't go to sleep) with an offset adjustment in case software has written a value to the TSC, the details of which are in the patent. If software does write to the TSC then the TSC for that core will be out of phase with other cores, but at a constant offset (the frequency of the TSC clocks are all tied to the bus reference clock by a constant multiplier).
On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate.
So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.
From the Intel System Manual 16.12.1
The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processors support for invariant TSC is
indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward.
On older processors you can not rely on either constant rate or synchronziation.
Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.
RTDSC is not synchronized across CPUs. Thus, you cannot rely on it in a multi-processor systems. The only workaround I can think of for Linux would be to actually restricting the process to run on a single CPU by settings its affinity. This can be done externally using using taskset utility or "internally" using sched_setaffinity or pthread_setaffinity_np functions.
This manual, chapter 17.12, describes the invariant TSC used in the newest processors. Available with Nehalem this time stamp, along with the rtscp instruction, allows one to read a timestamp (not affected by wait-states, etc) and a processor signature in one atomic operation.
It is said to be suitable for calculating wall-clock time, but it obviously doesn't expect the value to be the same across processors. The stated idea is that you can see if successive reads are to the same CPU's clock, or to adjust for multiple CPU reads. "It can also be used to adjust for per-CPU differences in TSC values in a NUMA system."
See also rdtsc accuracy across CPU cores
However, I'm not sure that the final consistency conclusion in the accepted answer follows from the statement that the tsc can be used for wall clock time. If it was consistent, what reason would there be for atomically determining the CPU source of the time.
N.B. The TSC information has moved from chapter 11 to chapter 17 in that Intel manual.
I'm looking for some kind of a library that gives me accurate CPU frequency values periodically on both Intel and AMD processors, on 32-bit and 64-bit Windows.
The purpose of this is to accuratly measure CPU load on a given computer. The problem is that calling QueryPerformanceCounter() returns clock ticks (used to measure the duration of an activity) but the underlying CPU frequency is not constant because of SpeedStep or TurboBoost. I've found several computers where turning off SpeedStep / TurboBoost in the BIOS and doesn't prevent CPU frequency scaling based on load.
I'm trying to see if there are any libraries available that could be used to detect CPU frequency changes (much like how Throttlestop / CPU-Z or even the Overview tab of Resource Monitor in Windows 7) so that I could query and save this information along with my other measurements. Performance counters don't seem to return reliable information, as I have computers that always return 100% CPU frequency, even when other tools show dynamic frequency changes.
I searched for such libraries but most results come back with gadgets, etc., that are not useful.
You can combine a high-resolution timer with a clock cycle counter to compute the current clock rate. On modern CPUs, the cycle counter can be read with this function:
static inline uint64_t get_cycles()
{
uint64_t t;
asm volatile ("rdtsc" : "=A"(t));
return t;
}
Note that this is per CPU, so if your program gets moved around CPUs, you're in trouble. If you know about CPU pinning techniques on your platform, you might like to try those.
For high resolution time measurement, you can use the tools in <chrono>; here's a semi-useful post of mine on the topic.
Try to focus on what you are trying to do, and not on how to do it.
What is your ultimate goal?
If, as you say, you are trying to "measure CPU load on a given computer", on Windows it may be a good practice using "PdhOpenQuery" and the "Pdh*" family functions.
See this SO answer as well:
How to determine CPU and memory consumption from inside a process?
Consider looking at the __rdtsc intrinsic function (#include "intrin.h" in Visual Studio).
This yields the clock count directly from the processor via the x86/x64 function RDTSC (Read Timestamp).