Speed of printf() - c

I was having some fun in C language with time.h library, trying to measure the number of clock ticks of some basic functions, just to figure out how fast they actually are.
I used the clock() function.
In this case I was measuring the printf() function.
Look at my program:
#include <stdio.h>
#include <time.h>
void main()
{
const int LIMIT = 2000;
const int LOOP = 20;
int results[LOOP];
for(int i=0; i<LOOP; i++)
{
int j;
clock_t time01 = clock();
for(j=1; j<LIMIT; j++)
{
printf("a");
}
clock_t time02 = clock();
results[i] = (int) (time02 - time01);
}
for(int i=0; i<LOOP; i++)
{
printf("\nCLOCK TIME: %d.", results[i]);
}
getchar();
}
The program just basically counts 20 times the number of clock ticks of 2000 times called printf("a") function.
The strange thing I don't understand is the result. I get most of the time, even when doing other tests, randomly two groups of results:
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 31.
CLOCK TIME: 47.
CLOCK TIME: 31.
I don't understand how exactly compiler handles that function. There is some test for % character I guess, but that wouldn't make that difference. Looks more like compiler is doing something in the memory... (?) Does anyone know the precise background of compiling this code or why there appears that difference mentioned above? Or at least some link that would help me?
Thank you.

I can think of at least two possible causes:
Your clock has limited resolution.
printf will occasionally be flushing its buffer.

Some compilers (in particular recent versions of gcc on recent Linux distributions, when optimizing with -O2) are able to optimize printf("a") into code very similar to putchar(a)
But most of the time is spent in the kernel doing the write system call.

man page of clock said that it returns an
approximation of processor time used by the program
This approxmation is based on the famous Time Stamp Counter. As wikipedia says :
It counts the number of cycles since reset
Sadly, nowadays, this counter can vary between core.
There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized.
So beware to lock your code on a certain cpu, otherwise, you will continue to have strange results. And since you seems to search precise results, you can use this code instead of clock call :
uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ( // serialize
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}

Related

Why is nop not taking one clock cycle

I wrote a basic code to find out the amount of clock cycles taken by nop. We know nop takes one clock cycle.
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint32_t low1, low2, high1, high2;
uint64_t timestamp1, timestamp2;
asm volatile ("rdtsc" : "=a"(low1), "=d"(high1));
asm("nop");
asm volatile ("rdtsc" : "=a"(low2), "=d"(high2));
timestamp1 = ((uint64_t)high1 << 32) | low1;
timestamp2 = ((uint64_t)high2 << 32) | low2;
printf("Diff:%lu\n", timestamp2 - timestamp1);
return 0;
}
But the output is not 1.
It is sometimes 14 or 16.
May i know the reason behind this. Am i missing anything
We know nop takes one clock cycle.
A modern CPU can be thought of as a pipeline of stages; where the front end might fetch and decode multiple instructions in parallel and put the resulting micro-ops into a buffer where they wait for their dependencies to be satisfied (before being taken by an execution unit, where multiple micro-ops can be executed at the same time by multiple execution units).
A NOP has no micro-ops - it's simply discarded by the front end. It doesn't cost 1 cycle.
But the output is not 1.
It probably takes 14 or 16 cycles for the instructions the compiler generates to deal with the outputs of the first rdtsc, then prepare things for the second rdtsc, then the second rdtsc itself.
Note that rdtsc probably counts the cycles of a fixed frequency timer that has nothing the CPU's current (variable) clock frequency; so 14 or 16 "time cycles" might be (e.g.) 7 or 8 CPU cycles.

Inconsistent values of ARM PMU cycles counter

I'm trying to measure performance of my code in linux kernel with pmu.
First of all I want to test pmu therefore created simple loop of couple operations in kernel. I placed it under spin lock with disabled interrupts so my test code can't be preempted. Then I printed cycle counter to check how much CPU cycles this loop takes. But I see very different values at each print: 100, 500, 1000, 200, ...
My question is: why I see so different values every time?
PS: in countrary to cycle counter, pmu's instruction counter is stable and I see same values every time.
I also tried to use arm timer but it also showing different values similar to pmu's cycle counter.
Here is how I use ARM timer to measure performance:
unsigned long long ticks_start, ticks_end;
int i = 0, j;
unsigned long flags;
spin_lock_irqsave(&lock, flags);
while (i++ < 100) {
j = 0;
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_start));
while (j++ < 10000) {
asm volatile ("nop");
}
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
printk("ticks %d are: %llu\n", i, ticks_end - ticks_start);
}
spin_unlock_irqrestore(&lock, flags);
and output on real device are (cortex A-57):
...
ticks 31 are: 2287
ticks 32 are: 2287
ticks 33 are: 2287
ticks 34 are: 1984
ticks 35 are: 457
ticks 36 are: 1604
ticks 37 are: 2287
...
For using things like timers and PMU on Arm, you should be inserting an isb instruction before the read of the PMU register. The processor is allowed by the architecture to speculatively read the register early, or late since it is not dependent on your inner loop of nops.
So try this:
asm volatile("isb; mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
The isb will flush the pipeline before letting the mrs instruction proceed. It is possible the CPU is also thermally throttling, but that should not affect your measurements using the cycle-counter, but it would if you were reading the generic timer to measure time.

The use of rdtsc() in my program to obtain the number of clock cycles for single- and double-word operations?

In theory the cost of double-word addition/subtraction is taken 2 times of a single-word. Similarly, the cost ratio of single-word multiplication to addition is taken as 3. I have written the following C program using GCC on Ubuntu LTS 14.04 to check the number of clock cycles on my machine, Intel Sandy Bridge Corei5-2410M. Although, most of the time the program returns 6 clock cycles for 128-bit addition but I have taken the best-case. I compiled using the command (gcc -o ow -O3 cost.c) and the result is given below
32-bit Add: Clock cycles = 1 64-bit Add: Clock cycles = 1 64-bit Mult: Clock cycles = 2 128-bit Add: Clock cycles = 5
The program is as follows:
#define n 500
#define counter 50000
typedef uint64_t utype64;
typedef int64_t type64;
typedef __int128 type128;
__inline__ utype64 rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ ("xorl %%eax,%%eax \n cpuid"::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (utype64)hi << 32 | lo;
}
int main(){
utype64 start, end;
type64 a[n], b[n], c[n];
type128 d[n], e[n], f[n];
int g[n], h[n];
unsigned short i, j;
srand(time(NULL));
for(i=0;i<n;i++){ g[i]=rand(); h[i]=rand(); b[i]=(rand()+2294967295); e[i]=(type128)(rand()+2294967295)*(rand()+2294967295);}
for(j=0;j<counter;j++){
start=rdtsc();
for(i=0;i<n;i++){ a[i]=(type64)g[i]+h[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(g[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ c[i]=a[i]+b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(a[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ d[i]=(type128)c[i]*b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Mult: Clock cycles = %lu \t", sizeof(c[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ f[i]=d[i]+e[i]; }
end=rdtsc();
if((j+1)%5000 == 0){
printf("%lu-bit Add: Clock cycles = %lu \n", sizeof(d[0])*8, (end-start)/n);
printf("f[%hu]= %ld %ld \n\n", i-7, (type64)(f[i-7]>>64), (type64)(f[i-7]));}
}
return 0;
}
There are two things in the result that bothers me.
1) Can the number of clock cycles for (64-bit) multiplication become 2?
2) Why the number of clock cycles for double-word addition is more than 2 times of the single-word addition?
I am mainly concerned for case (2). Now, the question arises that is it because of my program logic? Or Is it due to GCC compiler optimization?
In theory we know that the double-word addition/subtraction takes 2 times of a single-word.
No, we don't.
Similarly, the cost ratio of single-word multiplication to addition is taken as 3 because of fast integer multiplier of CPU.
No, it isn't.
You're not measuring instructions. You're measuring statements in your program. Which may or may not have any relationship with the instructions your compiler will emit. My compiler for example, after fixing your code so that it compiles, vectorized some of the loops. Adding multiple values per instruction. The first loop itself is still 23 instructions long and is still reported as 1 cycle by your code.
Modern (as in past 25 years) CPUs don't execute one instruction at a time. They'll have multiple instructions in flight at once and can execute them out of order.
Then you have memory accesses. On your CPU there are no instructions that can take a value from memory, add it to another value from memory and then store it in third memory location. So there must be multiple instructions executed already. Furthermore, memory accesses costs so much more than arithmetic instructions that anything that touches memory (unless it hits L1 cache all the time) will be dominated by the memory access time.
Furthermore, RDTSC might not even return the actual cycle count. Some CPUs have variable clock rates but still keep TSC going at the same rate regardless of how fast or slow the CPU is actually running because TSC is used by the operating system for time keeping. Others don't.
So you're not measuring what you think you're measuring and whoever told you those things was either oversimplifying vastly or hasn't seen CPU documentation in two decades.

clock() precision in time.h

I am trying to calculate the number of ticks a function uses to run and to do so using the clock() function like so:
unsigned long time = clock();
myfunction();
unsigned long time2 = clock() - time;
printf("time elapsed : %lu",time2);
But the problem is that the value it returns is a multiple of 10000, which I think is the CLOCK_PER_SECOND. Is there a way or an equivalent function value that is more precise?
I am using Ubuntu 64-bit, but would prefer if the solution can work on other systems like Windows & Mac OS.
There are a number of more accurate timers in POSIX.
gettimeofday() - officially obsolescent, but very widely available; microsecond resolution.
clock_gettime() - the replacement for gettimeofday() (but not necessarily so widely available; on an old version of Solaris, requires -lposix4 to link), with nanosecond resolution.
There are other sub-second timers of greater or lesser antiquity, portability, and resolution, including:
ftime() - millisecond resolution (marked 'legacy' in POSIX 2004; not in POSIX 2008).
clock() - which you already know about. Note that it measures CPU time, not elapsed (wall clock) time.
times() - CLK_TCK or HZ. Note that this measures CPU time for parent and child processes.
Do not use ftime() or times() unless there is nothing better. The ultimate fallback, but not meeting your immediate requirements, is
time() - one second resolution.
The clock() function reports in units of CLOCKS_PER_SEC, which is required to be 1,000,000 by POSIX, but the increment may happen less frequently (100 times per second was one common frequency). The return value must be divided by CLOCKS_PER_SEC to get time in seconds.
The most precise (but highly not portable) way to measure time is to count CPU ticks.
For instance on x86
unsigned long long int asmx86Time ()
{
unsigned long long int realTimeClock = 0;
asm volatile ( "rdtsc\n\t"
"salq $32, %%rdx\n\t"
"orq %%rdx, %%rax\n\t"
"movq %%rax, %0"
: "=r" ( realTimeClock )
: /* no inputs */
: "%rax", "%rdx" );
return realTimeClock;
}
double cpuFreq ()
{
ifstream file ( "/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq" );
string sFreq; if ( file ) file >> sFreq;
stringstream ssFreq ( sFreq ); double freq = 0.;
if ( ssFreq ) { ssFreq >> freq; freq *= 1000; } // kHz to Hz
return freq;
}
// Timing
unsigned long long int asmStart = asmx86Time ();
doStuff ();
unsigned long long int asmStop = asmx86Time ();
float asmDuration = ( asmStop - asmStart ) / cpuFreq ();
If you don't have an x86, you'll have to re-write the assembler code accordingly to your CPU. If you need maximum precision, that's unfortunatelly the only way to go... otherwise use clock_gettime().
Per the clock() manpage, on POSIX platforms the value of the CLOCKS_PER_SEC macro must be 1000000. As you say that the return value you're getting from clock() is a multiple of 10000, that would imply that the resolution is 10 ms.
Also note that clock() on Linux returns an approximation of the processor time used by the program. On Linux, again, scheduler statistics are updated when the scheduler runs, at CONFIG_HZ frequency. So if the periodic timer tick is 100 Hz, you get process CPU time consumption statistics with 10 ms resolution.
Walltime measurements are not bound by this, and can be much more accurate. clock_gettime(CLOCK_MONOTONIC, ...) on a modern Linux system provides nanosecond resolution.
I agree with the solution of Jonathan. Here is the implementation of clock_gettime() with nanoseconds of precision.
//Import
#define _XOPEN_SOURCE 500
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
int main(int argc, char *argv[])
{
struct timespec ts;
int ret;
while(1)
{
ret = clock_gettime (CLOCK_MONOTONIC, &ts);
if (ret)
{
perror ("clock_gettime");
return;
}
ts.tv_nsec += 20000; //goto sleep for 20000 n
printf("Print before sleep tid%ld %ld\n",ts.tv_sec,ts.tv_nsec );
// printf("going to sleep tid%d\n",turn );
ret = clock_nanosleep (CLOCK_MONOTONIC, TIMER_ABSTIME,&ts, NULL);
}
}
Although It's difficult to achieve ns precision, but this can be used to get precision for less than a microseconds (700-900 ns). printf above is used to just print the thread # (it'll definitely take 2-3 micro seconds to just print a statement).

Timer to find elapsed time in a function call in C

I want to calculate time elapsed during a function call in C, to the precision of 1 nanosecond.
Is there a timer function available in C to do it?
If yes please provide a sample code-snippet.
Pseudo code
Timer.Start()
foo();
Timer.Stop()
Display time elapsed in execution of foo()
Environment details: - using gcc 3.4 compiler on a RHEL machine
May I ask what kind of processor you're using? If you're using an x86 processor, you can look at the time stamp counter (tsc). This code snippet:
#define rdtsc(low,high) \
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))
will put the number of cycles the CPU has run in low and high respectively (it expects 2 longs; you can store the result in a long long int) as follows:
inline void getcycles (long long int * cycles)
{
unsigned long low;
long high;
rdtsc(low,high);
*cycles = high;
*cycles <<= 32;
*cycles |= low;
}
Note that this returns the number of cycles your CPU has performed. You'll need to get your CPU speed and then figure out how many cycles per ns in order to get the number of ns elapsed.
To do the above, I've parsed the "cpu MHz" string out of /proc/cpuinfo, and converted it to a decimal. After that, it's just a bit of math, and remember that 1MHz = 1,000,000 cycles per second, and that there are 1 billion ns / sec.
On Intel and compatible processors you can use rdtsc instruction which can be wrapped into an asm() block of C code easily. It returns the value of a built-in processor cycle counter that increments on each cycle. You gain high resolution and such timing is extremely fast.
To find how fast this increments you'll need to calibrate - call this instruction twice over a fixed time period like five seconds. If you do this on a processor that shifts frequency to lower power consumption you may have problems calibrating.
Use clock_gettime(3). For more info, type man 3 clock_gettime. That being said, nanosecond precision is rarely necessary.
Any timer functionality is going to have to be platform-specific, especially with that precision requirement.
The standard solution in POSIX systems is gettimeofday(), but it has only microsecond precision.
If this is for performance benchmarking, the standard way is to make the code under test take enough time to make the precision requirement less severe. In other words, run your test code for a whole second (or more).
There is no timer in c which has guaranteed 1 nanosecond precision. You may want to look into clock() or better yet The POSIX gettimeofday()
We all waste our time recreating this test sample. Why not post something compile ready? Anyway, here is mine with results.
CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 4194304 iterations : 459.427311 msec 0.110 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 4194304 iterations : 64.498347 msec 0.015 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 4194304 iterations : 65.494828 msec 0.016 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 4194304 iterations : 427.133157 msec 0.102 microsec / call
rdtsc 4194304 iterations : 115.427895 msec 0.028 microsec / call
Dummy 16110479703957395943
rdtsc in milliseconds 4194304 iterations : 197.259866 msec 0.047 microsec / call
Dummy 4.84682e+08 UltraHRTimerMs 197 HRTimerMs 197.26
#include <time.h>
#include <cstdio>
#include <string>
#include <iostream>
#include <chrono>
#include <thread>
enum { TESTRUNS = 1024*1024*4 };
class HRCounter
{
private:
timespec start, tmp;
public:
HRCounter(bool init = true)
{
if(init)
SetStart();
}
void SetStart()
{
clock_gettime(CLOCK_MONOTONIC, &start);
}
double GetElapsedMs()
{
clock_gettime(CLOCK_MONOTONIC, &tmp);
return (double)(tmp.tv_nsec - start.tv_nsec) / 1000000 + (tmp.tv_sec - start.tv_sec) * 1000;
}
};
__inline__ uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ( // serialize
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
inline uint64_t GetCyclesPerMillisecondImpl()
{
uint64_t start_cyles = rdtsc();
HRCounter counter;
std::this_thread::sleep_for (std::chrono::seconds(3));
uint64_t end_cyles = rdtsc();
double elapsed_ms = counter.GetElapsedMs();
return (end_cyles - start_cyles) / elapsed_ms;
}
inline uint64_t GetCyclesPerMillisecond()
{
static uint64_t cycles_in_millisecond = GetCyclesPerMillisecondImpl();
return cycles_in_millisecond;
}
class UltraHRCounter
{
private:
uint64_t start_cyles;
public:
UltraHRCounter(bool init = true)
{
GetCyclesPerMillisecond();
if(init)
SetStart();
}
void SetStart() { start_cyles = rdtsc(); }
double GetElapsedMs()
{
uint64_t end_cyles = rdtsc();
return (end_cyles - start_cyles) / GetCyclesPerMillisecond();
}
};
int main()
{
auto Run = [](std::string const& clock_name, clockid_t clock_id)
{
HRCounter counter(false);
timespec spec;
clock_getres( clock_id, &spec );
printf("%s resolution: %ld sec %ld nano\n", clock_name.c_str(), spec.tv_sec, spec.tv_nsec );
counter.SetStart();
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
clock_gettime( clock_id, &spec );
}
double fb = counter.GetElapsedMs();
printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n", TESTRUNS, ( fb ), (( fb ) * 1000) / TESTRUNS );
};
Run("CLOCK_PROCESS_CPUTIME_ID",CLOCK_PROCESS_CPUTIME_ID);
Run("CLOCK_MONOTONIC",CLOCK_MONOTONIC);
Run("CLOCK_REALTIME",CLOCK_REALTIME);
Run("CLOCK_THREAD_CPUTIME_ID",CLOCK_THREAD_CPUTIME_ID);
{
HRCounter counter(false);
uint64_t dummy;
counter.SetStart();
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
dummy += rdtsc();
}
double fb = counter.GetElapsedMs();
printf( "rdtsc %d iterations : %.6f msec %.3f microsec / call\n", TESTRUNS, ( fb ), (( fb ) * 1000) / TESTRUNS );
std::cout << "Dummy " << dummy << std::endl;
}
{
double dummy;
UltraHRCounter ultra_hr_counter;
HRCounter counter;
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
dummy += ultra_hr_counter.GetElapsedMs();
}
double fb = counter.GetElapsedMs();
double final = ultra_hr_counter.GetElapsedMs();
printf( "rdtsc in milliseconds %d iterations : %.6f msec %.3f microsec / call\n", TESTRUNS, ( fb ), (( fb ) * 1000) / TESTRUNS );
std::cout << "Dummy " << dummy << " UltraHRTimerMs " << final << " HRTimerMs " << fb << std::endl;
}
return 0;
}
I don't know if you'll find any timers that provide resolution to a single nanosecond -- it would depend on the resolution of the system clock -- but you might want to look at http://code.google.com/p/high-resolution-timer/. They indicate they can provide resolution to the microsecond level on most Linux systems and in the nanoseconds on Sun systems.
Making benchmarks on this scale is not a good idea. You have overhead for getting the time at the least, which can render your results unreliable if you work on nanoseconds. You can either use your platforms system calls or boost::Date_Time on a larger scale [preferred].
Can you just run it 10^9 times and stopwatch it?
You can use standard system calls like gettimeofday, if you are certain that your process gets 100% if the CPU time. I can think of many situation in which, while you are executing foo () other threads and processes might steal CPU time.
You are asking for something that is not possible this way. You would need HW level support to get to that level of precision and even then control the variables very carefully. What happens if you get an interrupt while running your code? What if the OS decides to run some other piece of code?
And what does your code do? Does it use RAM memory? What if your code and/or data is or is not in the cache?
In some environments you can use HW level counters for this job provided you control those variables. But how do you prevent context switches in Linux?
For instance, in Texas Instruments' DSP tools (Code Composer Studio) you can profile the code very exactly because the whole debugging environment is set such that the emulator (e.g. Blackhawk) receives info about every operation run. You can also set watchpoints which are coded directly into a HW block inside the chip in some processors. This works because the memory lanes are also routed to this debugging block.
They do offer functions in their CSL's (Chip Support Library) which are what you are asking for with the timing overhead being a few cycles. But this is only available for their processors and is completely dependant on reading the timer values from the HW registers.

Resources