Drain the instruction pipeline of Intel Core 2 Duo? - c

I'm writing some micro-benchmarking code for some very short operations in C. For example, one thing I'm measuring is how many cycles are needed to call an empty function depending on the number of arguments passed.
Currently, I'm timing using an RDTSC instruction before and after each operation to get the CPU's cycle count. However, I'm concerned that instructions issued before the first RDTSC may slow down the actual instructions I'm measuring. I'm also worried that the full operation may not be complete before the second RDTSC gets issued.
Does anyone know of an x86 instruction that forces all in-flight instructions to commit before any new instructions are issued? I've been told CPUID might do this, but I've been unable to find any documentation that says so.

To my knowledge, there is no instruction which specifically "drains" the pipeline. This can be easily accomplished though using a serialising instruction.
CPUID is a serializing instruction, which means exactly what you're looking for. Every instruction issues before it is guaranteed to execute before the CPUID instruction.
So doing the following should get the desired effect:
cpuid
rdtsc
# stuff
cpuid
rdtsc
But, as an aside, I don't recommend that you do this. Your "stuff" can still be effected by a lot of other things outside of your control (such as CPU caches, other processes running on the system, etc), and you'll never be able to eliminate them all. The best way to get accurate performance statistics is to perform the operation(s) you want to measure at least several million times and average out the execution time of the batch.
Edit: Most instruction references for CPUID will mention its serializing properties, such as the NASM manual appendix B .
Edit 2: Also might want to take a look at this related question.

Related

Which are the different variable cycle ARM instructions?

I was reading this book "ARM System Developers Guide" by Elsevier and I came across this:
The ARM instruction set differs from the pure RISC definition in several ways that make
the ARM instruction set suitable for embedded applications:
Variable cycle execution for certain instructions — Not every ARM instruction executes in a single cycle. For example, load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The
transfer can occur on sequential memory addresses, which increases performance since
sequential memory accesses are often faster than random accesses. Code density is also
improved since multiple register transfers are common operations at the start and end
of functions.
Any other ARM instructions you guys can point out which take variable cycles to execute?
Cycle timings are micro architecture dependent, so you need to check particular implementation's technical reference manual (TRM). For example for Cortex-A9, it is described as being quite complicated.
The complexity of the Cortex-A9 processor makes it impossible to calculate precise timing information manually. The timing of an instruction is often affected by other concurrent instructions, memory system activity, and additional events outside the instruction flow.
However on the same document there are precise timings for data-processing, load and store, multiplication and some information about branch and serialization instructions.
For example from the same document you can see if shifting is involved AND instruction may take 1-2 cycles more depending on the shift source, which might be a constant embedded in instruction or read from a register.
Also next to book's note about load-store-multiple may vary on number of registers involved, they also vary if address is aligned or not.

C Program measure execution time for an instruction

I need to find the time taken to execute a single instruction or a few couple of instructions and print it out in terms of milli seconds. Can some one please share the small code snippet for this.
Thanks.. I need to use this measure the time taken to execute some instructions in my project.
#include<time.h>
main()
{
clock_t t1=clock();
printf("Dummy Statement\n");
clock_t t2=clock();
printf("The time taken is.. %g ", (t2-t1));
Please look at the below liks too.
What’s the correct way to use printf to print a clock_t?
http://www.velocityreviews.com/forums/t454464-c-get-time-in-milliseconds.html
One instruction will take a lot shorter than 1 millisecond to execute. And if you are trying to measure more than one instruction it will get complicated (what about the loop that calls the instruction multiple times).
Also, most timing functions that you can use are just that: functions. That means they will execute instructions also. If you want to time one instruction then the best bet is to look up the specifications of the processor that you are using and see how many cycles it takes.
Doing this programatically isn't possible.
Edit:
Since you've updated your question to now refer to some instructions. You can measure sub-millisecond time on some processors. It would be nice to know the environment. This will work on x86 and linux, other environments will be different.
Clock get time allows forr sub-nanosecond accuracy. Or you can call the rdstc instruction yourself (good luck with this on a multiprocessor or smp system - you could be measuring the wrong thing, eg by having the instruction run on different processors).
The time to actually complete an instruction depends on the clock cycle time, and the depth of the pipeline the instruction traverses through the processor. As dave said, you can't really find this out by making a program. You can use some kind of timing function provided to you by your OS to measure the cpu time it takes to complete some small set of instructions. If you do this, try not to use any kind of instructions that rely on memory, or branching. Ideally you might do some kind of logical or arithmetic operations (so perhaps using some inline assembly in C).

Why instrumented C program runs faster?

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.
Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications "should" in theory only affect cache and not directly the process execution time)
I was suspecting "instruction cache effects", maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind --tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.
Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, ...)
Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.
Very few additional work (such as incrementing a counter) might alter compiler's decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.
Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.
Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.
If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target's address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.
To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.
To investigate CPU effects, use a profiler allowing to inspect processor performance counters.
Just guessing from my experience with embedded compilers, Optimization tools in compilers look for recursive tasks. Perhaps the additional code forced the compiler to see something more recursive and it structured the machine code differently. Compilers do some weird things for optimization. In some languages (Perl I think?) a "not not" conditional is faster to execute than a "true" conditional. Does your debugging tool allow you to single step through a code/assembly comparison? This could add some insight as to what the compiler decided to do with the extra tasks.

how to count cycles?

I'm trying to find the find the relative merits of 2 small functions in C. One that adds by loop, one that adds by explicit variables. The functions are irrelevant themselves, but I'd like someone to teach me how to count cycles so as to compare the algorithms. So f1 will take 10 cycles, while f2 will take 8. That's the kind of reasoning I would like to do. No performance measurements (e.g. gprof experiments) at this point, just good old instruction counting.
Is there a good way to do this? Are there tools? Documentation? I'm writing C, compiling with gcc on an x86 architecture.
http://icl.cs.utk.edu/papi/
PAPI_get_real_cyc(3) - return the total number of cycles since some arbitrary starting point
Assembler instruction rdtsc (Read Time-Stamp Counter) retun in EDX:EAX registers the current CPU ticks count, started at CPU reset. If your CPU runing at 3GHz then one tick is 1/3GHz.
EDIT:
Under MS windows the API call QueryPerformanceFrequency return the number of ticks per second.
Unfortunately timing the code is as error prone as visually counting instructions and clock cycles. Be it a debugger or other tool or re-compiling the code with a re-run 10000000 times and time it kind of thing, you change where things land in the cache line, the frequency of the cache hits and misses, etc. You can mitigate some of this by adding or removing some code upstream from the module of code being tested, (to cause a few instructions added and removed changing the alignment of your program and sometimes of your data).
With experience you can develop an eye for performance by looking at the disassembly (as well as the high level code). There is no substitute for timing the code, problem is timing the code is error prone. The experience comes from many experiements and trying to fully understand why adding or removing one instruction made no or dramatic differences. Why code added or removed in a completely different unrelated area of the module under test made huge performance differences on the module under test.
As GJ has written in another answer I also recommend using the "rdtsc" instruction (rather than calling some operating system function which looks right).
I've written quite a few answers on this topic. Rdtsc allows you to calculate the elapsed clock cycles in the code's "natural" execution environment rather than having to resort to calling it ten million times which may not be feasible as not all functions are black boxes.
If you want to calculate elapsed time you might want to shut off energy-saving on the CPUs. If it's only a matter of clock cycles this is not necessary.
If you are trying to compare the performance, the easiest way is to put your algorithm in a loop and run it 1000 or 1000000 times.
Once you are running it enough times that the small differences can be seen, run time ./my_program which will give you the amount of processor time that it used.
Do this a few times to get a sampling and compare the results.
Trying to count instructions won't help you on x86 architecture. This is because different instructions can take significantly different amounts of time to execute.
I would recommend using simulators. Take a look at PTLsim it will give you the number of cycles, other than that maybe you would like to take a look at some tools to count the number of times each assembly line is executing.
Use gcc -S your_program.c. -S tells gcc to generate the assembly listing, that will be named your_program.s.
There are plenty of high performance clocks around. QueryPerformanceCounter is microsofts. The general trick is to run the function 10s of thousands of time and time how long it takes. Then divide the time taken by the number of loops. You'll find that each loop takes a slightly different length of time so this testing over multiple passes is the only way to truly find out how long it takes.
This is not really a trivial question. Let me try to explain:
There are several tools on different OS to do exactly what you want, but those tools are usually part of a bigger environment. Every instruction is translated into a certain number of cycles, depending on the CPU the compiler ran on, and the CPU the program was executed.
I can't give you a definitive answer, because I do not have enough data to pass my judgement on, but I work for IBM in the database area and we use tools to measure cycles and instructures for our code and those traces are only valid for the actual CPU the program was compiled and was running on.
Depending on the internal structure of your CPU's piplining and on the effeciency of your compiler, the resulting code will most likely still have cache misses and other areas you have to worry about. (In that case you may want to look into FDPR...)
If you want to know how many cycles your program needs to run on your CPU (which was compiled with your compiler), you have to understand how the CPU works and how the compiler generarated the code.
I'm sorry, if the answer was not sufficient enough to solve your problem at hand. You said you are using gcc on an x86 arch. I would work with getting the assembly code mapped to your CPU.
I'm sure you will find some areas, where gcc could have done a better job...

measure time to execute single instruction

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?

Resources