Clock, rdtsc and CLOCKS_PER_SEC - c

I am trying to implement my own version of clock() using asm and rdtsc. However I am quite unsure about its return value. Is it cycles? Oder is it micro seconds?
I am also confused about CLOCKS_PER_SEC. How can this be constant?
Is there any kind of formula which sets these values into relation?

You can find a rdtsc reference implementation here:
https://github.com/LITMUS-RT/liblitmus/blob/master/arch/x86/include/asm/cycles.h
TSC counts the number of cycles since reset. If you need a time value unit in seconds, you also need to read the CPU clock frequency and divide TSC value by frequency. However, this may not be accurate if the CPU frequency scaling is enabled. Recent Intel processors include a constant rate TSC (identified by the "constant_tsc" flag in Linux's /proc/cpuinfo). With these processors, the TSC ticks at the processor's nominal frequency, regardless of the actual CPU clock frequency due to turbo or power saving states.
https://en.wikipedia.org/wiki/Time_Stamp_Counter

Related

Measure time (in clock ticks) with contiki

I tried to measure the time of some operations on the telosB platform. For that I wanted to count the clock ticks of the processor with the clock() function from time.h but it does not compile on contiki.
Are there mechanisms to measure passed time, preferably in actual clock ticks, on contiki?
Regards
The latest timer documentation is here: https://github.com/contiki-ng/contiki-ng/wiki/Documentation:-Timers
You can use the clock_ticks() function. However, the resolution of those is quite low (1/128 of second). If you want measure shorter time intervals, use rtimers: RTIMER_NOW() returns the time as 16-bit integer, with platform-specific resolution. On most platforms the rtimer clock has 32678 ticks per second, but on CC26xx/CC13xx platforms it has 65536 ticks per second.
See also: Contiki difference between RTIMER_NOW() and clock_time()

Performance Counting for Cycles Inconsistent and not Reflecting CPU Frequency

Intro:
I have written a Linux kernel module for performance counter monitoring on an ARM v7 platform with cortex A-15 and A-7 processors (Odroid XU3). One counter I am trying to use in my research is cycle counts, which from the ARM technical reference manuals has its own dedicated counter. I have checked my code against other implementations and ARM references found online; here is a snippet of the part that enables the CPU counters:
Resources Used:
How to measure program execution time in ARM Cortex-A8 processor?
http://neocontra.blogspot.se/2013/05/user-mode-performance-counters-for.html
https://pietrotech.wordpress.com/2016/09/28/sample-performance-counters-on-little-and-big-cluster-on-odroid-xu3-processor-exynos-5422/
ARM Reference Manual (Ch. 11, PMU)
Problem:
When I print the cycles elapsed over a fixed sampling period (100ms) for a fixed CPU frequency (1.4GHz in the case of core 0), I see a huge amount of variance in the values returned by the module. See the chart below for an example of this. Not only does the variance seem very high, but the number of cycles measured does not reflect the number of cycles I would expect to see recorded given the sample time and fixed frequency (for the given scenario I expected 1.4e8 cycles on each sample). What could be causing such divergence from the expected number of cycles?
Variability of measured cycles for kernel module running across all cores and across just core 0.
After further though and discussions with colleagues, I believe the discrepancy between measured and expected cycles is cpuidle: it is a subsystem in the Linux kernel that places a CPU core into a lower-power state when the core is not doing anything. Some of the lowest states shut down the clock, which likely causes the cycle counter to stop incrementing. This article gives a nice description of cpuidle and how it works: https://lwn.net/Articles/384146/

Concept of clock tick and clock cycles

I have written a very small code to measure the time taken by my multiplication algorithm :
clock_t begin, end;
float time_spent;
begin = clock();
a = b*c;
end = clock();
time_spent = (float)(end - begin)/CLOCKS_PER_SEC;
I am working with mingw under Windows.
I am guessing that end = clock() will give me the clock ticks at that particular moment. Subtracting it from begin will give me clock ticks consumed by multiplication. When I divide with CLOCKS_PER_SEC, I will get the total amount of time.
My first question is: Is there a difference between clock ticks and clock cycle?
My algorithm here is so small that the difference end-begin is 0. Does this mean that my code execution time was less than 1 tick and that's why I am getting zero?
My first question is: Is there a difference between clock ticks and clock cycle?
Yes. A clock tick could be 1 millisecond or microsecond while the clock cycle could be 0.3 nanoseconds. On POSIX systems CLOCKS_PER_SEC must be defined as 1000000 (1 million). Note that if the CPU measurement cannot be obtained with microsecond resolution then the smallest jump in the return value from clock() will be larger than one.
My algorithm here is so small that the difference end-begin is 0. Does this mean that my code execution time was less than 1 tick and that's why I am getting zero?
Yes. To get a better reading I suggest that you loop enough iterations so that you measure over several seconds.
Answering the difference between clock tick and clock cycle from a systems perspective
Every processor is accompanied by a physical clock (usually quartz crystal clock), which oscillates at certain frequency (vibrations/sec). The processor keeps track of time by the help of interrupts generated from the physical clock, which interrupts the processor at every time period T. This interrupt is called a 'clock tick'. CPU counts the number of interrupts it has seen since the system has started, and returns that value when you call clock(). By taking a difference between two clock ticks values (obtained from clock()), you would get how many interrupts that were seen between those two time points.
Most of the modern operating systems program the T value to be 1 microsecond i.e. the physical clock interrupts at every 1 microsecond, this is the lowest clock granularity which is widely supported by most of the physical clocks. With 1 microsecond as T, the clock cycle can be calculated as 1000000 per second. So, with this information, you can calculate the time elapsed from the difference of two clock ticks values i.e. diff between two ticks * tick period
NOTE: clock cycle defined by the OS has to be <= vibrations/sec on the physical clock, otherwise there will be a loss of precision
four your first question: clock ticks refer to the main system clock. It is the smallest unit of time recognized by the device. clock cycle is the time taken for a full processor pulse to complete. this u can recognize by your cpu cpeed given in Hz. a 2GHz processor performs 2,000,000,000 clock cycles per second.
for your second question: probably yes.
A clock cycle is a clock tick.
A clock cycle is the speed of a computer processor, or CPU, and is determined by the amount of time between two pulses of an oscillator. Generally speaking, the higher number of pulses per second, the faster the computer processor will be able to process information.

Timing code on Intel CPUs using "core clock cycles"?

What is this method of timing code on Intel processors, referred to by Agnor fod as "core clock cycles":
http://gcc.gnu.org/ml/gcc/2008-07/msg00424.html
My test results, referred to above, uses the "core clock cycles"
performance counter on Intel and RDTSC on AMD. It's the highest
resolution you can get.
I always thought RDTSC was the best way to measure on Intel CPUs too? What is this other technique he speaks about and how do you measure using it?
He is referring to the Intel PCM, the Performance Counter Monitor.
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
By the way in the original post linked article, http://www.agner.org/optimize/optimizing_cpp.pdf
the poster explicity refers to it:
The time stamp counter is a little inaccurate on microprocessors that can change the clock frequency (Intel SpeedStepĀ® technology). A more accurate measurement can be obtained with a performance monitor counter for "core clock cycles", using the test program mentioned above

Assembly CPU frequency measuring algorithm

What are the common algorithms being used to measure the processor frequency?
Intel CPUs after Core Duo support two Model-Specific registers called IA32_MPERF and IA32_APERF.
MPERF counts at the maximum frequency the CPU supports, while APERF counts at the actual current frequency.
The actual frequency is given by:
You can read them with this flow
; read MPERF
mov ecx, 0xe7
rdmsr
mov mperf_var_lo, eax
mov mperf_var_hi, edx
; read APERF
mov ecx, 0xe8
rdmsr
mov aperf_var_lo, eax
mov aperf_var_hi, edx
but note that rdmsr is a privileged instruction and can run only in ring 0.
I don't know if the OS provides an interface to read these, though their main usage is for power management, so it might not provide such an interface.
I'm gonna date myself with various details in this answer, but what the heck...
I had to tackle this problem years ago on Windows-based PCs, so I was dealing with Intel x86 series processors like 486, Pentium and so on. The standard algorithm in that situation was to do a long series of DIVide instructions, because those are typically the most CPU-bound single instructions in the Intel set. So memory prefetch and other architectural issues do not materially affect the instruction execution time -- the prefetch queue is always full and the instruction itself does not touch any other memory.
You would time it using the highest resolution clock you could get access to in the environment you are running in. (In my case I was running near boot time on a PC compatible, so I was directly programming the timer chips on the motherboard. Not recommended in a real OS, usually there's some appropriate API to call these days).
The main problem you have to deal with is different CPU types. At that time there was Intel, AMD and some smaller vendors like Cyrix making x86 processors. Each model had its own performance characteristics vis-a-vis that DIV instruction. My assembly timing function would just return a number of clock cycles taken by a certain fixed number of DIV instructions done in a tight loop.
So what I did was to gather some timings (raw return values from that function) from actual PCs running each processor model I wanted to time, and record those in a spreadsheet against the known processor speed and processor type. I actually had a command-line tool that was just a thin shell around my timing function, and I would take a disk into computer stores and get the timings off of display models! (I worked for a very small company at the time).
Using those raw timings, I could plot a theoretical graph of what timings I should get for any known speed of that particular CPU.
Here was the trick: I always hated when you would run a utility and it would announce that your CPU was 99.8 Mhz or whatever. Clearly it was 100 Mhz and there was just a small round-off error in the measurement. In my spreadsheet I recorded the actual speeds that were sold by each processor vendor. Then I would use the plot of actual timings to estimate projected timings for any known speed. But I would build a table of points along the line where the timings should round to the next speed.
In other words, if 100 ticks to do all that repeating dividing meant 500 Mhz, and 200 ticks meant 250 Mhz, then I would build a table that said that anything below 150 was 500 Mhz, and anything above that was 250 Mhz. (Assuming those were the only two speeds available from that chip vendor). It was nice because even if some odd piece of software on the PC was throwing off my timings, the end result would often still be dead on.
Of course now, in these days of overclocking, dynamic clock speeds for power management, and other such trickery, such a scheme would be much less practical. At the very least you'd need to do something to make sure the CPU was in its highest dynamically chosen speed first before running your timing function.
OK, I'll go back to shooing kids off my lawn now.
One way on x86 Intel CPU's since Pentium would be to use two samplings of the RDTSC instruction with a delay loop of known wall time, eg:
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
uint64_t rdtsc(void) {
uint64_t result;
__asm__ __volatile__ ("rdtsc" : "=A" (result));
return result;
}
int main(void) {
uint64_t ts0, ts1;
ts0 = rdtsc();
sleep(1);
ts1 = rdtsc();
printf("clock frequency = %llu\n", ts1 - ts0);
return 0;
}
(on 32-bit platforms with GCC)
RDTSC is available in ring 3 if the TSC flag in CR4 is set, which is common but not guaranteed. One shortcoming of this method is that it is vulnerable to frequency scaling changes affecting the result if they happen inside the delay. To mitigate that you could execute code that keeps the CPU busy and constantly poll the system time to see if your delay period has expired, to keep the CPU in the highest frequency state available.
I use the following (pseudo)algorithm:
basetime=time(); /* time returns seconds */
while (time()==basetime);
stclk=rdtsc(); /* rdtsc is an assembly instruction */
basetime=time();
while (time()==basetime
endclk=rdtsc();
nclks=encdclk-stclk;
At this point you might assume that you've determined the clock frequency but even though it appears correct it can be improved.
All PCs contain a PIT (Programmable Interval Timer) device which contains counters which are (used to be) used for serial ports and the system clock. It was fed with a frequency of 1193182 Hz. The system clock counter was set to the highest countdown value (65536) resulting in a system clock tick frequency of 1193182/65536 => 18.2065 Hz or once every 54.925 milliseconds.
The number of ticks necessary for the clock to increment to the next second will therefore depend. Usually 18 ticks are required and sometimes 19. This can be handled by performing the algorithm (above) twice and storing the results. The two results will either be equivalent to two 18 tick sequences or one 18 and one 19. Two 19s in a row won't occur. So by taking the smaller of the two results you will have an 18 tick second. Adjust this result by multiplying with 18.2065 and dividing by 18.0 or, using integer arithmetic, multiply by 182065, add 90000 and divide by 180000. 90000 is one half of 180000 and is there for rounding. If you choose the calculation with integer route make sure you are using 64-bit multiplication and division.
You will now have a CPU clock speed x in Hz which can be converted to kHz ((x+500)/1000) or MHz ((x+5000000)/1000000). The 500 and 500000 are one half of 1000 and 1000000 respectively and are there for rounding. To calculate MHz do not go via the kHz value because rounding issues may arise. Use the Hz value and the second algorithm.
That was the intention of things like BogoMIPS, but CPUs are a lot more complicated nowadays. Superscalar CPUs can issue multiple instructions per clock, making any measurement based on counting clock cycles to execute a block of instructions highly inaccurate.
CPU frequencies are also variable based on offered load and/or temperature. The fact that the CPU is currently running at 800 MHz does not mean it will always be running at 800 MHz, it might throttle up or down as needed.
If you really need to know the clock frequency, it should be passed in as a parameter. An EEPROM on the board would supply the base frequency, and if the clock can vary you'd need to be able to read the CPUs power state registers (or make an OS call) to find out the frequency at that instant.
With all that said, there may be other ways to accomplish what you're trying to do. For example if you want to make high-precision measurements of how long a particular codepath takes, the CPU likely has performance counters running at a fixed frequency which are a better measure of wall-clock time than reading a tick count register.
"lmbench" provides a cpu frequency algorithm portable for different architecture.
It runs some different loops and the processor's clock speed is the greatest common divisor of the execution frequencies of the various loops.
this method should always work when we are able to get loops with cycle counts that are relatively prime.
http://www.bitmover.com/lmbench/
One option is to sense the CPU frequency, by running code with known instructions per loop
This functionality is contained in 7zip, since about v9.20 I think.
> 7z b
7-Zip 9.38 beta Copyright (c) 1999-2014 Igor Pavlov 2015-01-03
CPU Freq: 4266 4000 4266 4000 2723 4129 3261 3644 3362
The final number is meant to be correct (and on my PC and many others, I have found it to be quite correct - the test runs very quick so turbo may not kick in, and servers set in Balanced/Power Save modes most likely give readings of around 1ghz)
The source code is at GitHub (Official source is a download from 7-zip.org)
With the most significant portion being:
#define YY1 sum += val; sum ^= val;
#define YY3 YY1 YY1 YY1 YY1
#define YY5 YY3 YY3 YY3 YY3
#define YY7 YY5 YY5 YY5 YY5
static const UInt32 kNumFreqCommands = 128;
EXTERN_C_BEGIN
static UInt32 CountCpuFreq(UInt32 sum, UInt32 num, UInt32 val)
{
for (UInt32 i = 0; i < num; i++)
{
YY7
}
return sum;
}
EXTERN_C_END
On Intel CPUs, a common method to get the current (average) CPU frequency is to calculate it from a few CPU counters:
CPU_freq = tsc_freq * (aperf_t1 - aperf_t0) / (mperf_t1 - mperf_t0)
The TSC (Time Stamp Counter) can be read from userspace with dedicated x86 instructions, but its frequency has to be determined by calibration against a clock. The best approach is to get the TSC frequency from the kernel (which already has done the calibration).
The aperf and mperf counters are model specific registers MSRs that require root privileges for access. Again, there are dedicated x86 instructions for accessing the MSRs.
Since the mperf counter rate is directly proportional to the TSC rate and the aperf rate is directly proportional to the CPU frequency you get the CPU frequency with the above equation.
Of course, if the CPU frequency changes in your t0 - t1 time delta (e.g. due due frequency scaling) you get the average CPU frequency with this method.
I wrote a small utility cpufreq which can be used to test this method.
See also:
[PATCH] x86: Calculate MHz using APERF/MPERF for cpuinfo and scaling_cur_freq. 2016-04-01, LKML
Frequency-invariant utilization tracking for x86. 2020-04-02, LWN.net
I'm not sure why you need assembly for this. If you're on a machine that has the /proc filesystem, then running:
> cat /proc/cpuinfo
might give you what you need.
A quick google on AMD and Intel shows that CPUID should give you access to the CPU`s max frequency.

Resources