RDTSCP versus RDTSC + CPUID - c

I'm doing some Linux Kernel timings, specifically in the Interrupt Handling path. I've been using RDTSC for timings, however I recently learned it's not necessarily accurate as the instructions could be happening out of order.
I then tried:
RDTSC + CPUID (in reverse order, here) to flush the pipeline, and incurred up to a 60x overhead (!) on a Virtual Machine (my working environment) due to hypercalls and whatnot. This is both with and without HW Virtualization enabled.
Most recently I've come across the RDTSCP* instruction, which seems to do what RDTSC+CPUID did, but more efficiently as it's a newer instruction - only a 1.5x-2x overhead, relatively.
My question: Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?
Also to be more clear, my timing is essentially like this, internally:
Save the current cycle counter value
Perform one type of benchmark (i.e.: disk, network)
Add the delta of the current and previous cycle counter to an accumulator value and increment a counter, per individual interrupt
At the end, divide the delta/accumulator by the number of interrupts to get the average cycle cost per interrupt.
*http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf page 27

A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).
You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.
Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:
uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();
atomic_add(e - s, &acc);
atomic_add(1, &counter);
You may still have an off-by-one in your average measurement depending on where your read happens. For instance:
T1 T2
t0 atomic_add(e - s, &acc);
t1 a = atomic_read(&acc);
t2 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4 avg = a / c;
It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.
Side-points:
If you do use cpuid+rdtsc, you need to subtract out the cost of the cpuid instruction, which may be difficult to ascertain if you're in a VM (depending on how the VM implements this instruction). This is really why you should stick with rdtscp.
Executing rdtscp inside a loop is usually a bad idea. I somewhat frequently see microbenchmarks that do things like
--
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
s = rdtscp();
loop_body();
e = rdtscp();
acc += e - s;
}
printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body(), it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:
s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.

The 2010 Intel paper How to Benchmark Code Execution Times on Intel ® IA-32 and IA-64 Instruction Set Architectures can be considered as outdated when it comes to its recommendations to combine RDTSC/RDTSCP with CPUID.
Current Intel reference documentation recommends fencing instructions as more efficient alternatives to CPUID:
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory
ordering than the CPUID instruction.
(Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 8.2.5, September 2016)
If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
(Intel RDTSC)
Thus, to get the TSC start value you execute this instruction sequence:
mfence
lfence
rdtsc
shl rdx, 0x20
or rax, rdx
At the end of your benchmark, to get the TSC stop value:
rdtscp
lfence
shl rdx, 0x20
or rax, rdx
Note that in contrast to CPUID, the lfence instruction doesn't clobber any registers, thus it isn't necessary to rescue the EDX:EAX registers before executing the serializing instruction.
Relevant documentation snippet:
If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP
(Intel RDTSCP)
As an example how to integrate this into a C program, see also my GCC inline assembler implementations of the above operations.

Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?
Modern x86 CPUs can dynamically adjust the frequency to save power by under clocking (e.g. Intel's SpeedStep) and to boost performance for heavy load by over-clocking (e.g. Intel's Turbo Boost). The time stamp counter on these modern processors however counts at a constant rate (e.g. look for "constant_tsc" flag in Linux's /proc/cpuinfo).
So the answer to your question depends on what you really want to know. Unless, dynamic frequency scaling is disabled (e.g. in the BIOS) the time stamp counter can no longer be relied on to determine the number of cycles that have elapsed. However, the time stamp counter can still be relied on to determine the time that has elapsed (with some care - but I use clock_gettime in C - see the end of my answer).
To benchmark my matrix multiplication code and compare it to the theoretical best I need to know both the time elapsed and the cycles elapsed (or rather the effective frequency during the test).
Let me present three different methods to determine the number of cycles elapsed.
Disable dynamic frequency scaling in the BIOS and use the time stamp counter.
For Intel processors request the core clock cycles from the performance monitor counter.
Measure the frequency under load.
The first method is the most reliable but it requires access to BIOS and affects the performance of everything else you run (when I disable dynamic frequency scaling on my i5-4250U it runs at a constant 1.3 GHz instead of a base of 2.6 GHz). It's also inconvenient to change the BIOS only for benchmarking.
The second method is useful when you don't want to disable dynamic frequency scale and/or for systems you don't have physical access to. However, the performance monitor counters require privileged instructions which only the kernel or device drivers have access to.
The third method is useful on systems where you don't have physical access and do not have privileged access. This is the method I use most in practice. It's in principle the least reliable but in practice it's been as reliable as the second method.
Here is how I determine the time elapsed (in seconds) with C.
#define TIMER_TYPE CLOCK_REALTIME
timespec time1, time2;
clock_gettime(TIMER_TYPE, &time1);
foo();
clock_gettime(TIMER_TYPE, &time2);
double dtime = time_diff(time1,time2);
double time_diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return (double)temp.tv_sec + (double)temp.tv_nsec*1E-9;
}

The following code will ensure that rdstcp kicks in at exactly the right time.
RDTSCP cannot execute too early, but it can execute to late because the CPU can move instructions after rdtscp to execute before it.
In order to prevent this we create a false dependency chain based on the fact that rdstcp puts its output in edx:eax
rdtscp ;rdstcp is read serialized, it will not execute too early.
;also ensure it does not execute too late
mov r8,rdx ;rdtscp changes rdx and rax, force dependency chain on rdx
xor r8,rbx ;push rbx, do not allow push rbx to execute OoO
xor rbx,rdx ;rbx=r8
xor rbx,r8 ;rbx = 0
push rdx
push rax
mov rax,rbx ;rax = 0, but in a way that excludes OoO execution.
cpuid
pop rax
pop rdx
mov rbx,r8
xor rbx,rdx ;restore rbx
Note that even though this time is accurate up to a single cycle.
You still need to run your sample many many times and take the lowest time of those many runs in order to get the actual running time.

Related

_mm_lfence() time overhead is non deterministic?

I am trying to determine time needed to read an element to make sure it's a cache hit or a cache miss. for reading to be in order I use _mm_lfence() function. I got unexpected results and after checking I saw that lfence function's overhead is not deterministic.
So I am executing the program that measures this overhead in a loop of for example 100 000 iteration. I get results of more than 1000 clock cycle for one iteration and next time it's 200. What can be a reason of such difference between lfence function overheads and if it is so unreliable how can I judge latency of cache hits and cache misses correctly? I was trying to use same approach as in this post: Memory latency measurement with time stamp counter
the code that gives unreliable results is this:
for(int i=0; i < arr_size; i++){
_mm_mfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
arr[i] = t2-t1;
}
the values in arr vary in different ranges, arr_size is 100 000.
I get results of more than 1000 clock cycle for one iteration and next time it's 200.
Sounds like your CPU ramped up from idle to normal clock speed after the first few iterations.
Remember that RDTSC counts reference cycles (fixed frequency, equal or close to the max non-turbo frequency of the CPU), not core clock cycles. (idle/turbo / whatever). Older CPUs had RDTSC count core clock cycles, but for years now CPU vendors have had fixed RDTSC frequency making it useful for clock_gettime(), and advertized this fact with the invariant_tsc CPUID feature bit. See also Get CPU cycle count?
If you really want to use RDTSC instead of performance counters, disable turbo and use a warm-up loop to get your CPU to its max frequency.
There are libraries that let you program the HW performance counters, and set permissions so you can run rdpmc in user-space. This actually has lower overhead than rdtsc. See What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture for a summary of ways to access perf counters in user-space.
I also found a paper about adding user-space rdpmc support to Linux perf (PAPI): ftp://ftp.cs.uoregon.edu/pub/malony/ESPT/Papers/espt-paper-1.pdf. IDK if that made it into mainline kernel/perf code or not.

On a system with ARM A9 processor, L2CACHE, SRAM. is it possible to have a C program to get the following performance data

On a system with ARM A9 processor, L2CACHE, SRAM. is it possible to have a C program to get the following performance data:
Avg. SRAM data Fetch delay.
Avg. Instruction Fetch delay.
If you have hardware targets to run and measure on, you could create test code that can get cycle counts between different points of its execution, using Cortex-A9 PMU (ref A9 TRM chapter 11). Your test code would need to initialize and read from PMU registers. Then, PMU will measure cycle count and give other interesting data e.g. number of cache misses. That much is doable with software.
However, that resulting performance data may not be as low-level as you may want.
Consider a loop over a block of NOP instructions, with loop counter in a register. L1 instruction cache would fill on the first iteration. PMU can give you a measurement of instruction cycles and total time. That measurement would relate to L1 instruction fetch delay (unless you use a really big block, in which case you might shed light on L2).
Similarly you could construct test code whose execution time will also include effect of data fetch delay.
There is ARM example code which shows how PMU can be used.
You may find processor internals to be complicated. If L2 is your primary interest, the controller e.g. L2C-310 may have its own event counters, although I haven't used such.

CPU Cycle count based profiling in C/C++ Linux x86_64

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?
I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.
Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.
The right way is either:
Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
Use a cycle-level emulating profiler like Callgrind / kcachegrind.
By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.
You are on the right track1, but you need to do two things:
Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.
1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.
Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.
If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.
That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.
The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.
Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?
I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.
Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?
If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}

Throughput calculation using cycle count

Is it possible to determine the throughput of an application on a processor from the cycle counts (Processor instruction cycles) consumed by the application ? If yes, how to calculate it ?
If the process is entirely CPU bound, then you divide the processor speed by the number of cycles to get the throughput.
In reality, few processes are entirely CPU bound though, in which case you have to take other factors (disk speed, memory speed, serialization, etc.) into account.
Simple:
#include <time.h>
clock_t c;
c = clock(); // c holds clock ticks value
c = c / CLOCKS_PER_SEC; // real time, if you need it
Note that the value you get is an approximation, for more info see the clock() man page.
Some CPUs have internal performance registers which enable you to collect all sorts of interesting statistics, such as instruction cycles (sometimes even on a per execution unit basis), cache misses, # of cache/memory reads/writes, etc. You can access these directly, but depending on what CPU and OS you are using there may well be existing tools which manage all the details for you via a GUI. Often a good profiling tool will have support for performance registers and allow you to collect statistics using them.
If you use the Cortex-M3 from TI/Luminary Micro, you can make use of the driverlib delivered by TI/Luminary Micro.
Using the SysTick functions you can set the SysTickPeriod to 1 processor cycle: So you have 1 processor clock between interrupts. By counting the number of interrupts you should get a "near enough estimation" on how much time a function or function block take.

Assembly CPU frequency measuring algorithm

What are the common algorithms being used to measure the processor frequency?
Intel CPUs after Core Duo support two Model-Specific registers called IA32_MPERF and IA32_APERF.
MPERF counts at the maximum frequency the CPU supports, while APERF counts at the actual current frequency.
The actual frequency is given by:
You can read them with this flow
; read MPERF
mov ecx, 0xe7
rdmsr
mov mperf_var_lo, eax
mov mperf_var_hi, edx
; read APERF
mov ecx, 0xe8
rdmsr
mov aperf_var_lo, eax
mov aperf_var_hi, edx
but note that rdmsr is a privileged instruction and can run only in ring 0.
I don't know if the OS provides an interface to read these, though their main usage is for power management, so it might not provide such an interface.
I'm gonna date myself with various details in this answer, but what the heck...
I had to tackle this problem years ago on Windows-based PCs, so I was dealing with Intel x86 series processors like 486, Pentium and so on. The standard algorithm in that situation was to do a long series of DIVide instructions, because those are typically the most CPU-bound single instructions in the Intel set. So memory prefetch and other architectural issues do not materially affect the instruction execution time -- the prefetch queue is always full and the instruction itself does not touch any other memory.
You would time it using the highest resolution clock you could get access to in the environment you are running in. (In my case I was running near boot time on a PC compatible, so I was directly programming the timer chips on the motherboard. Not recommended in a real OS, usually there's some appropriate API to call these days).
The main problem you have to deal with is different CPU types. At that time there was Intel, AMD and some smaller vendors like Cyrix making x86 processors. Each model had its own performance characteristics vis-a-vis that DIV instruction. My assembly timing function would just return a number of clock cycles taken by a certain fixed number of DIV instructions done in a tight loop.
So what I did was to gather some timings (raw return values from that function) from actual PCs running each processor model I wanted to time, and record those in a spreadsheet against the known processor speed and processor type. I actually had a command-line tool that was just a thin shell around my timing function, and I would take a disk into computer stores and get the timings off of display models! (I worked for a very small company at the time).
Using those raw timings, I could plot a theoretical graph of what timings I should get for any known speed of that particular CPU.
Here was the trick: I always hated when you would run a utility and it would announce that your CPU was 99.8 Mhz or whatever. Clearly it was 100 Mhz and there was just a small round-off error in the measurement. In my spreadsheet I recorded the actual speeds that were sold by each processor vendor. Then I would use the plot of actual timings to estimate projected timings for any known speed. But I would build a table of points along the line where the timings should round to the next speed.
In other words, if 100 ticks to do all that repeating dividing meant 500 Mhz, and 200 ticks meant 250 Mhz, then I would build a table that said that anything below 150 was 500 Mhz, and anything above that was 250 Mhz. (Assuming those were the only two speeds available from that chip vendor). It was nice because even if some odd piece of software on the PC was throwing off my timings, the end result would often still be dead on.
Of course now, in these days of overclocking, dynamic clock speeds for power management, and other such trickery, such a scheme would be much less practical. At the very least you'd need to do something to make sure the CPU was in its highest dynamically chosen speed first before running your timing function.
OK, I'll go back to shooing kids off my lawn now.
One way on x86 Intel CPU's since Pentium would be to use two samplings of the RDTSC instruction with a delay loop of known wall time, eg:
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
uint64_t rdtsc(void) {
uint64_t result;
__asm__ __volatile__ ("rdtsc" : "=A" (result));
return result;
}
int main(void) {
uint64_t ts0, ts1;
ts0 = rdtsc();
sleep(1);
ts1 = rdtsc();
printf("clock frequency = %llu\n", ts1 - ts0);
return 0;
}
(on 32-bit platforms with GCC)
RDTSC is available in ring 3 if the TSC flag in CR4 is set, which is common but not guaranteed. One shortcoming of this method is that it is vulnerable to frequency scaling changes affecting the result if they happen inside the delay. To mitigate that you could execute code that keeps the CPU busy and constantly poll the system time to see if your delay period has expired, to keep the CPU in the highest frequency state available.
I use the following (pseudo)algorithm:
basetime=time(); /* time returns seconds */
while (time()==basetime);
stclk=rdtsc(); /* rdtsc is an assembly instruction */
basetime=time();
while (time()==basetime
endclk=rdtsc();
nclks=encdclk-stclk;
At this point you might assume that you've determined the clock frequency but even though it appears correct it can be improved.
All PCs contain a PIT (Programmable Interval Timer) device which contains counters which are (used to be) used for serial ports and the system clock. It was fed with a frequency of 1193182 Hz. The system clock counter was set to the highest countdown value (65536) resulting in a system clock tick frequency of 1193182/65536 => 18.2065 Hz or once every 54.925 milliseconds.
The number of ticks necessary for the clock to increment to the next second will therefore depend. Usually 18 ticks are required and sometimes 19. This can be handled by performing the algorithm (above) twice and storing the results. The two results will either be equivalent to two 18 tick sequences or one 18 and one 19. Two 19s in a row won't occur. So by taking the smaller of the two results you will have an 18 tick second. Adjust this result by multiplying with 18.2065 and dividing by 18.0 or, using integer arithmetic, multiply by 182065, add 90000 and divide by 180000. 90000 is one half of 180000 and is there for rounding. If you choose the calculation with integer route make sure you are using 64-bit multiplication and division.
You will now have a CPU clock speed x in Hz which can be converted to kHz ((x+500)/1000) or MHz ((x+5000000)/1000000). The 500 and 500000 are one half of 1000 and 1000000 respectively and are there for rounding. To calculate MHz do not go via the kHz value because rounding issues may arise. Use the Hz value and the second algorithm.
That was the intention of things like BogoMIPS, but CPUs are a lot more complicated nowadays. Superscalar CPUs can issue multiple instructions per clock, making any measurement based on counting clock cycles to execute a block of instructions highly inaccurate.
CPU frequencies are also variable based on offered load and/or temperature. The fact that the CPU is currently running at 800 MHz does not mean it will always be running at 800 MHz, it might throttle up or down as needed.
If you really need to know the clock frequency, it should be passed in as a parameter. An EEPROM on the board would supply the base frequency, and if the clock can vary you'd need to be able to read the CPUs power state registers (or make an OS call) to find out the frequency at that instant.
With all that said, there may be other ways to accomplish what you're trying to do. For example if you want to make high-precision measurements of how long a particular codepath takes, the CPU likely has performance counters running at a fixed frequency which are a better measure of wall-clock time than reading a tick count register.
"lmbench" provides a cpu frequency algorithm portable for different architecture.
It runs some different loops and the processor's clock speed is the greatest common divisor of the execution frequencies of the various loops.
this method should always work when we are able to get loops with cycle counts that are relatively prime.
http://www.bitmover.com/lmbench/
One option is to sense the CPU frequency, by running code with known instructions per loop
This functionality is contained in 7zip, since about v9.20 I think.
> 7z b
7-Zip 9.38 beta Copyright (c) 1999-2014 Igor Pavlov 2015-01-03
CPU Freq: 4266 4000 4266 4000 2723 4129 3261 3644 3362
The final number is meant to be correct (and on my PC and many others, I have found it to be quite correct - the test runs very quick so turbo may not kick in, and servers set in Balanced/Power Save modes most likely give readings of around 1ghz)
The source code is at GitHub (Official source is a download from 7-zip.org)
With the most significant portion being:
#define YY1 sum += val; sum ^= val;
#define YY3 YY1 YY1 YY1 YY1
#define YY5 YY3 YY3 YY3 YY3
#define YY7 YY5 YY5 YY5 YY5
static const UInt32 kNumFreqCommands = 128;
EXTERN_C_BEGIN
static UInt32 CountCpuFreq(UInt32 sum, UInt32 num, UInt32 val)
{
for (UInt32 i = 0; i < num; i++)
{
YY7
}
return sum;
}
EXTERN_C_END
On Intel CPUs, a common method to get the current (average) CPU frequency is to calculate it from a few CPU counters:
CPU_freq = tsc_freq * (aperf_t1 - aperf_t0) / (mperf_t1 - mperf_t0)
The TSC (Time Stamp Counter) can be read from userspace with dedicated x86 instructions, but its frequency has to be determined by calibration against a clock. The best approach is to get the TSC frequency from the kernel (which already has done the calibration).
The aperf and mperf counters are model specific registers MSRs that require root privileges for access. Again, there are dedicated x86 instructions for accessing the MSRs.
Since the mperf counter rate is directly proportional to the TSC rate and the aperf rate is directly proportional to the CPU frequency you get the CPU frequency with the above equation.
Of course, if the CPU frequency changes in your t0 - t1 time delta (e.g. due due frequency scaling) you get the average CPU frequency with this method.
I wrote a small utility cpufreq which can be used to test this method.
See also:
[PATCH] x86: Calculate MHz using APERF/MPERF for cpuinfo and scaling_cur_freq. 2016-04-01, LKML
Frequency-invariant utilization tracking for x86. 2020-04-02, LWN.net
I'm not sure why you need assembly for this. If you're on a machine that has the /proc filesystem, then running:
> cat /proc/cpuinfo
might give you what you need.
A quick google on AMD and Intel shows that CPUID should give you access to the CPU`s max frequency.

Resources