Why is nop not taking one clock cycle - c

I wrote a basic code to find out the amount of clock cycles taken by nop. We know nop takes one clock cycle.
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint32_t low1, low2, high1, high2;
uint64_t timestamp1, timestamp2;
asm volatile ("rdtsc" : "=a"(low1), "=d"(high1));
asm("nop");
asm volatile ("rdtsc" : "=a"(low2), "=d"(high2));
timestamp1 = ((uint64_t)high1 << 32) | low1;
timestamp2 = ((uint64_t)high2 << 32) | low2;
printf("Diff:%lu\n", timestamp2 - timestamp1);
return 0;
}
But the output is not 1.
It is sometimes 14 or 16.
May i know the reason behind this. Am i missing anything

We know nop takes one clock cycle.
A modern CPU can be thought of as a pipeline of stages; where the front end might fetch and decode multiple instructions in parallel and put the resulting micro-ops into a buffer where they wait for their dependencies to be satisfied (before being taken by an execution unit, where multiple micro-ops can be executed at the same time by multiple execution units).
A NOP has no micro-ops - it's simply discarded by the front end. It doesn't cost 1 cycle.
But the output is not 1.
It probably takes 14 or 16 cycles for the instructions the compiler generates to deal with the outputs of the first rdtsc, then prepare things for the second rdtsc, then the second rdtsc itself.
Note that rdtsc probably counts the cycles of a fixed frequency timer that has nothing the CPU's current (variable) clock frequency; so 14 or 16 "time cycles" might be (e.g.) 7 or 8 CPU cycles.

Related

C microbenchmark 'bug' when measuring store latency

I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
char array[ARRAY_SIZE];
for (int i =0; i< ARRAY_SIZE; i++)
array[i] = 'x'; //This is to force the OS to give allocate the array
asm volatile ("mfence\n");
for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
struct result_tuple{
uint64_t tsp_start;
uint64_t tsp_end;
int offset;
};
struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
for (int i = 0; i< ARRAY_SIZE; i++)
{
uint64_t *tsp_start,*tsp_end;
tsp_start = &results[i].tsp_start;
tsp_end = &results[i].tsp_end;
results[i].offset = i;
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_start)
::"rax","rdx","rcx","memory"
);
array[i] = 'y'; //A simple store
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_end)
::"rax","rdx","rcx","memory"
);
}
printf("Offset\tLatency\n");
for (int i=0;i<ARRAY_SIZE;i++)
{
printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
}
free (results);
}
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
In a single run, all the latencies are similar
When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset Latency
1 275
2 262
3 262
4 262
5 275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset Latency
1 75
2 75
3 75
4 72
5 72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.
Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a -O0 debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. (Idiomatic way of performance evaluation?)
RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See How to get the CPU cycle count in x86_64 from C++?
Also, this debug (-O0) build will do extra stores and reloads inside your timed region, but "fortunately" the results[i].offset = i; store plus the mfence before the first rdtscp ensures the result array is also hot in cache before entering the timed region.
Your array is tiny, and you're only doing 1-byte stores (so 64 stores are all in the same cache line.) It's very likely still in MESI Modified state from when you initialized it, so I wouldn't expect an RFO on any of the array[i] = 'y' stores. That already happened for the few lines of stack memory involved before your timed loop. If you want to pre-fault the array without also getting it cached, maybe touch one line per 4k page and leave the other lines untouched. But HW prefetch will get ahead of your stores, especially if you only store 1 byte at a time with 2 slow mfences per store, so again the waiting for off-core memory requests will be outside the timed region. You should expect data to already be in L1d cache or at least L2 in Exclusive state, ready to be flipped to Modified on a store.
BTW, having an offset member seems pointless; it can be implicit from the array index. e.g. print i instead of offset[i]. It's also not very useful to store both start and stop absolute TSC values. You could just store a 32-bit difference, then you wouldn't need to shift / OR in your inline asm, just declare a clobber on the unused EDX output.
Also note that "store latency" typically only matters for performance in real code when mfence is involved. Otherwise the important thing is store->load forwarding, which can happen from the store buffer before the store commits to L1d cache. That's about 6 cycles, or sometimes lower if the reload isn't attempted right away. (It's variable on Sandybridge-family.)

Inconsistent values of ARM PMU cycles counter

I'm trying to measure performance of my code in linux kernel with pmu.
First of all I want to test pmu therefore created simple loop of couple operations in kernel. I placed it under spin lock with disabled interrupts so my test code can't be preempted. Then I printed cycle counter to check how much CPU cycles this loop takes. But I see very different values at each print: 100, 500, 1000, 200, ...
My question is: why I see so different values every time?
PS: in countrary to cycle counter, pmu's instruction counter is stable and I see same values every time.
I also tried to use arm timer but it also showing different values similar to pmu's cycle counter.
Here is how I use ARM timer to measure performance:
unsigned long long ticks_start, ticks_end;
int i = 0, j;
unsigned long flags;
spin_lock_irqsave(&lock, flags);
while (i++ < 100) {
j = 0;
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_start));
while (j++ < 10000) {
asm volatile ("nop");
}
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
printk("ticks %d are: %llu\n", i, ticks_end - ticks_start);
}
spin_unlock_irqrestore(&lock, flags);
and output on real device are (cortex A-57):
...
ticks 31 are: 2287
ticks 32 are: 2287
ticks 33 are: 2287
ticks 34 are: 1984
ticks 35 are: 457
ticks 36 are: 1604
ticks 37 are: 2287
...
For using things like timers and PMU on Arm, you should be inserting an isb instruction before the read of the PMU register. The processor is allowed by the architecture to speculatively read the register early, or late since it is not dependent on your inner loop of nops.
So try this:
asm volatile("isb; mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
The isb will flush the pipeline before letting the mrs instruction proceed. It is possible the CPU is also thermally throttling, but that should not affect your measurements using the cycle-counter, but it would if you were reading the generic timer to measure time.

Best way to add delay/do nothing for n cpu cycles

I need to add a delay into my code of n CPU cycles (~30).
My current solution is the one below, which works but isn't very elegant.
Also, the delay has to be known at compile time. I can work with this, but it would be ideal if I could change the delay at runtime.
(It is OK if there is some overhead, but I need the 1 cycle resolution.)
I do not have any peripheral timers left, that I could use, so it needs to be a software solution.
do_something();
#define NUMBER_OF_NOPS (SOME_DELAY + 3)
#include "nops.h"
#undef NUMBER_OF_NOPS
do_the_next_thing();
nops.h:
#if NUMBER_OF_NOPS > 0
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 1
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 2
__ASM volatile ("nop");
#endif
...
In the cortex devices NOP is something which literally means nothing. There is no guarantee that the NOP will consume any time.They are used for padding only. I you will have several consecutive NOPs they will just be flushed from the pipeline.
For more information refer to the Cortex-M0 documentation. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CHDJJGFB.html
software delays are quite tricky in the Cortex devices and you should use other instructions + possibly barrier instructions instead.
use ISB instructions 4 clocks + flash access time which depend what speed the core is running. For very precise delays place this part of code in the SRAM
Edit: There is a better answer from another SO Q&A here. However it is in assembly, AFAIK using a counter like SysTick is the only way to guarantee any semblance of cycle accuracy.
Edit 2: To avoid a counter overflow, which would result in a very, very long delay, clear the SysTick counter before use, ie. SysTick->VAL = 0;
Original:
Cortex-Ms have a built in timer called SysTick which can be used for cycle accurate timing purposes.
First enable the timer:
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk |
SysTick_CTRL_ENABLE_Msk;
Then you can read the current count using the VAL register. You can then implement a cycle accurate delay this way:
int count = SysTick->VAL;
while(SysTick->VAL < (count+30));
Note that this will introduce some overhead because of the load, compare and branch in the loop so the final cycle count will be a little off, no more than a few ticks in my estimation.
You can use a free-running up-counter as follows:
uint32_t t = <periph>.count;
while ((<periph>.count - t) < delay);
As long as delay is less than half the period of the counter, this is unaffected by wrapping of the counter value - the unsigned arithmetic produces the correct time delta.
Note that since you don't need to control the counter's value in any way, you can use any such counter in the system - even if it's being used for another purpose (as long, of course, as it really is running continuously and freely, and at a rate that gives you the timing resolution that you require).

The use of rdtsc() in my program to obtain the number of clock cycles for single- and double-word operations?

In theory the cost of double-word addition/subtraction is taken 2 times of a single-word. Similarly, the cost ratio of single-word multiplication to addition is taken as 3. I have written the following C program using GCC on Ubuntu LTS 14.04 to check the number of clock cycles on my machine, Intel Sandy Bridge Corei5-2410M. Although, most of the time the program returns 6 clock cycles for 128-bit addition but I have taken the best-case. I compiled using the command (gcc -o ow -O3 cost.c) and the result is given below
32-bit Add: Clock cycles = 1 64-bit Add: Clock cycles = 1 64-bit Mult: Clock cycles = 2 128-bit Add: Clock cycles = 5
The program is as follows:
#define n 500
#define counter 50000
typedef uint64_t utype64;
typedef int64_t type64;
typedef __int128 type128;
__inline__ utype64 rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ ("xorl %%eax,%%eax \n cpuid"::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (utype64)hi << 32 | lo;
}
int main(){
utype64 start, end;
type64 a[n], b[n], c[n];
type128 d[n], e[n], f[n];
int g[n], h[n];
unsigned short i, j;
srand(time(NULL));
for(i=0;i<n;i++){ g[i]=rand(); h[i]=rand(); b[i]=(rand()+2294967295); e[i]=(type128)(rand()+2294967295)*(rand()+2294967295);}
for(j=0;j<counter;j++){
start=rdtsc();
for(i=0;i<n;i++){ a[i]=(type64)g[i]+h[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(g[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ c[i]=a[i]+b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(a[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ d[i]=(type128)c[i]*b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Mult: Clock cycles = %lu \t", sizeof(c[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ f[i]=d[i]+e[i]; }
end=rdtsc();
if((j+1)%5000 == 0){
printf("%lu-bit Add: Clock cycles = %lu \n", sizeof(d[0])*8, (end-start)/n);
printf("f[%hu]= %ld %ld \n\n", i-7, (type64)(f[i-7]>>64), (type64)(f[i-7]));}
}
return 0;
}
There are two things in the result that bothers me.
1) Can the number of clock cycles for (64-bit) multiplication become 2?
2) Why the number of clock cycles for double-word addition is more than 2 times of the single-word addition?
I am mainly concerned for case (2). Now, the question arises that is it because of my program logic? Or Is it due to GCC compiler optimization?
In theory we know that the double-word addition/subtraction takes 2 times of a single-word.
No, we don't.
Similarly, the cost ratio of single-word multiplication to addition is taken as 3 because of fast integer multiplier of CPU.
No, it isn't.
You're not measuring instructions. You're measuring statements in your program. Which may or may not have any relationship with the instructions your compiler will emit. My compiler for example, after fixing your code so that it compiles, vectorized some of the loops. Adding multiple values per instruction. The first loop itself is still 23 instructions long and is still reported as 1 cycle by your code.
Modern (as in past 25 years) CPUs don't execute one instruction at a time. They'll have multiple instructions in flight at once and can execute them out of order.
Then you have memory accesses. On your CPU there are no instructions that can take a value from memory, add it to another value from memory and then store it in third memory location. So there must be multiple instructions executed already. Furthermore, memory accesses costs so much more than arithmetic instructions that anything that touches memory (unless it hits L1 cache all the time) will be dominated by the memory access time.
Furthermore, RDTSC might not even return the actual cycle count. Some CPUs have variable clock rates but still keep TSC going at the same rate regardless of how fast or slow the CPU is actually running because TSC is used by the operating system for time keeping. Others don't.
So you're not measuring what you think you're measuring and whoever told you those things was either oversimplifying vastly or hasn't seen CPU documentation in two decades.

Why does my CPU suddenly work twice as fast?

I've been trying to use a simple profiler to measure the efficiency of some C code on a school server, and I'm hitting an odd situation. After a short amount of time (half a second-ish), the processor suddenly starts executing instructions twice as fast. I've tested for just about every possible reason I could think of (caching, load balancing on cores, CPU frequency being altered due to coming out of sleep), but everything seems normal.
For what it's worth, I'm doing this testing on a school linux server, so it's possible there's an unusual configuration I don't know about, but the processor ID being used doesn't change, and (via top) the server was completely idle as I tested.
Test code:
#include <time.h>
#include <stdio.h>
#define MY_CLOCK CLOCK_MONOTONIC_RAW
// no difference if set to CLOCK_THREAD_CPUTIME_ID
typedef struct {
unsigned int tsc;
unsigned int proc;
} ans_t;
static ans_t rdtscp(void){
ans_t ans;
__asm__ __volatile__ ("rdtscp" : "=a"(ans.tsc), "=c"(ans.proc) : : "edx");
return ans;
}
static void nop(void){
__asm__ __volatile__ ("");
}
void test(){
for(int i=0; i<100000000; i++) nop();
}
int main(){
int c=10;
while(c-->0){
struct timespec tstart,tend;
ans_t start = rdtscp();
clock_gettime(MY_CLOCK,&tstart);
test();
ans_t end = rdtscp();
clock_gettime(MY_CLOCK,&tend);
unsigned int tdiff = (tend.tv_sec-tstart.tv_sec)*1000000000+tend.tv_nsec-tstart.tv_nsec;
unsigned int cdiff = end.tsc-start.tsc;
printf("%u cycles and %u ns (%lf GHz) start proc %u end proc %u\n",cdiff,tdiff,(double)cdiff/tdiff,start.proc,end.proc);
}
}
Output I see:
351038093 cycles and 125680883 ns (2.793091 GHz) start proc 14 end proc 14
350911246 cycles and 125639359 ns (2.793004 GHz) start proc 14 end proc 14
350959546 cycles and 125656776 ns (2.793001 GHz) start proc 14 end proc 14
351533280 cycles and 125862608 ns (2.792992 GHz) start proc 14 end proc 14
350903833 cycles and 125636787 ns (2.793002 GHz) start proc 14 end proc 14
350924336 cycles and 125644157 ns (2.793002 GHz) start proc 14 end proc 14
349827908 cycles and 125251782 ns (2.792997 GHz) start proc 14 end proc 14
175289886 cycles and 62760404 ns (2.793001 GHz) start proc 14 end proc 14
175283424 cycles and 62758093 ns (2.793001 GHz) start proc 14 end proc 14
175267026 cycles and 62752232 ns (2.793001 GHz) start proc 14 end proc 14
I get similar output (with it taking a different number of tests to double in efficiency) using different optimization levels (-O0 to -O3).
Could it perhaps have something to do with hyperthreading, where two logical cores in a physical core (the server is using Xeon X5560s which may have this effect) can somehow "merge" to form one twice-as-fast processor?
Some systems scale the processor speed depending on the system load. As you justly note, this is particularly annoying when benchmarking.
If your server is running Linux, please type
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
If this outputs ondemand, powersave or userspace, then CPU frequency scaling is active, and you're going to find it very difficult to do benchmarks. If this says performance, then CPU frequency scaling is disabled.
Some CPUs have optimizations on the chip, which are learning the path your code usually takes. By sucessfully forecast what the next if statement would do, it is not needed to discard the queue, and freshly load all the new operations from scratch. Depending on the chip and the algorithm, it might take 5 to 10 cycles, until it successfully forecasts the if statements. But somehow there are also reasons that speak against this as beeing the reason for this behaviour.
Looking at your Output i would say this might also just be the sheduling of the OS and or the CPU Frequency governor used there. Are you very sure the CPU frequency doesn't change during the execution of your code? No CPU boost?
Using linux tools like cpufreq are often used to regulate the cpu frequency.
Hyper-threading means replicating the register space, not the actual decode/execution units - so this is not a solution.
To test the accuracy of the micro-benchmark method I would do the following:
Run the program with high priority
Count the number of instructions to see if it is correct. I would do that using perf stat ./binary - that means you need to have perf. I would do this multiple times and look at the clocks and instructions metrics to see how multiple instructions can execute in a single cycle.
I have some additional remarks:
For each nop you also to a comparison and a conditional jump in the for loop. If you really want to execute NOPs I'd write a statement like this:
#define NOP5 __asm__ __volatile__ ("nop nop nop nop nop");
#define NOP25 NOP5 NOP5 NOP5 NOP5 NOP5
#define NOP100 NOP25 NOP25 NOP25 NOP25
#define NOP500 NOP100 NOP100 NOP100 NOP100 NOP100
...
for(int i=0; i<100000000; i++)
{
NOP500 NOP500 NOP500 NOP500
}
This construct will allow you to actually do NOP's instead of comparing i with 100M.

Resources