some doubts regarding cycles of ARM NEON

some doubts regarding cycles of ARM NEON - arm

I wrote some neon code in assembly and was aiming at maximum optimization. Though latency due to register conflict and pipeline is reduced it is showing only 1 cycle difference i.e before n.70-0 after n.69-0. why it is showing like that i did n't understand.
here is my sample code
before optimization http://pulsar.webshaker.net/ccc/sample-6b7ba7c2
after optimization http://pulsar.webshaker.net/ccc/sample-d59091b4
i have so many doubts in pulsar calculator.
1. n.16-0 1c d0:1
here n stands for what?
2. a.23-0 2c q6l:1 VMLA.I16 q6, q9, D0[2]
a stand for what? l:1 means? does 23 is the cycles count?
3. does count Time means total time for execution of code ?
hope kindly somebody will help me regarding these doubts....

This is what I can remember about this cycle counter:
"n" stands for Neon pipeline, "a" stands for ARM pipeline. In fact you are mixing ARM and NEON instructions.
Regarding "q6l:1": q6l is the register which cause the current instructions to wait, while 1 is the number of extra half-cycles needed for this register/result to became available to the instruction, therefore is the number of half-cycles the instructions have to wait for his input. I'm not sure but I suppose that "q6l" is the lower part of the q6 register.
The number "23" in your example is the number of cycle in which the instruction can start the execution.
Count time has nothing to do with your code. Parse time is the time the tool tooks to interpret the instructions you provided. Count time is the time the tool tooks to analyze your instructions and provide the cycle informations.
I'll explain more the results, for example:
n.18-0 1c n0 q10:8
"n" stands for the execution unit (n = neon, a = arm, v = vfp).
"18" is the number of cycle in which the instruction can start the execution.
"0" is the number of the pipeline.
"1c" is the number of execution cycles for the instruction. Please NOTE that this is different from the number of cycles required until the result of the instruction is available for further instructions.
"n0" is the pipeline causing the current instruction to wait a result. n0 = neon pipeline number 0.
"q10" is the register causing the instruction to wait for the result.
"8" is related to the time the instruction have to wait for the results. It is the number of half-cycles if I remember correctly.
This counter does not consider the fact that a compiler can re-arrange instructions, i.e. postponing an instruction that is waiting a result. But if you impose your compiler to not re-arrange the assembly instructions, when an instruction have to wait a result no other instructions can start the execution even if they have not to wait for a result, therefore this causes an execution stall in which the CPU cannot execute any instruction.
Moreover, I would not use this type of counter for code with loops. I suggest you to split your code in different parts and optimize each loop separately.

Related

Is there a loop construct that repeats n times without calculating some conditional?

This question arose out of the context of optimizing code to remove potential branch prediction failures.. in fact, removing branches all together.
For my example, a typical for-loop uses the following syntax:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main()
{
bool *S = calloc(N + 1, sizeof(bool));
int p = 2;
for (int i = p * p; i <= N; i += p)
S[i] = 1;
return 0;
}
As I understand it, the assembly code produced would contain some sort of JMP instruction to check whether i <= N.
The only way I can conceive doing this in assembly would be to just repeat the same assembly instructions n times with i being incremented in each repetition. However, for large n, the binary produced would be huge.
So, I'm wondering, is there some loop construct that repeats n times without calculating some conditional?

Fully unrolling is the only practical way, and you're right that it's not appropriate for huge iteration counts (and usually not for loops where the iteration count isn't a compile-time constant). It's great for small loops with a known iteration count.
For your specific case, of setting memory to 1 over a given distance, x86 has rep stosb, which implements memset in microcode. It doesn't suffer from branch misprediction on current Intel CPUs, because microcode branches can't be predicted / speculatively executed and stall instead (or something), which means it has about 15 cycles of startup overhead before doing any stores. See What setup does REP do? For large
aligned buffers, rep stosb/d/q is pretty close to an optimized AVX loop.
So you don't get a mispredict, but you do get a fixed startup overhead. (I think this stalls issue / execute of instructions after the rep stos, because the microcode sequencer takes over the front end until it's done issuing all the uops. And it can't know when it's done until it's executed some that look at rcx. Issue happens in-order, so later independent instructions can't even get into the out-of-order part of the core until after rep stos figures out which uops to issue. The stores don't have to execute, but the microcode branches do.) Icelake is supposed to have "fast short REP MOV", which may finally solve the startup-overhead issue. Possibly by adding a dedicated hardware state-machine for rep movs and rep stos, like #KrazyGlew wishes he had when designing fast-strings in Intel's P6 uarch (ppro/PII/PIII), the ancestor of current Intel CPUs which still use very similar microcoded implementations. AFAIK, AMD is similar, but I haven't seen numbers or details for their rep stos startup overhead.
Most architectures don't have a single-instruction memset like that, so x86 is definitely a special case in computer architecture. But some old computers (like Atari ST) could have a "blitter" chip to offload copying memory, especially for graphics purposes. They'd use DMA to do the copying separately from the CPU altogether. This would be very similar to building a hardware state machine on-chip to run rep movs.
Misprediction of normal loop branches
Consider a normal asm loop, like
.looptop: do {
# loop body includes a pointer increment
# may be unrolled some to do multiple stores per conditional branch
cmp rdi, rdx
jb .looptop } while(dst < end_dst);
The branch ends up strongly predicted-taken after the loop has run a few time.
For large iteration counts, one branch mispredict is amortized over all the loop iterations and is typically negligible. (The conditional branch at the bottom of the loop is predicted taken, so the loop branch mispredicts the one time it's not taken.)
Some branch predictors have special support for loop branches, with a pattern predictor that can count patterns like taken 30 times / not taken. With some large limit for the max iteration count they can correctly predict.
Or a modern TAGE predictor (like in Intel Sandybridge-family) uses branch history to index the entry, so it "naturally" gives you some pattern prediction. For example, Skylake correctly predicts the loop-exit branch for inner loops up to about 22 iterations in my testing. (When there's a simple outer loop that re-runs the inner loop with the same iteration count repeatedly.) I tested in asm, not C, so I had control of how much unrolling was done.
A medium-length loop that's too long for the exit to be predicted correctly is the worst case for this. It's short enough that a mispredict on loop exit happens frequently, if it's an inner loop that repeatedly runs ~30 iterations on a CPU that can't predict that, for example.
The cost of one mispredict on the last iteration can be quite low. If out-of-order execution can "see" a few iterations ahead for a simple loop counter, it can have the branch itself executed before spending much time on real work beyond that mispredicted branch. With fast recovery for branch misses (not flushing the whole out-of-order pipeline by using something like a Branch Order Buffer), you still lose the chance to get started on independent work after the loop for a few cycles. This is most likely if the loop bottlenecks on latency of a dependency chain, but the counter itself is a separate chain. This paper about the real cost of branch misses is also interesting, and mentions this point.
(Note that I'm assuming that branch prediction is already "primed", so the first iteration correctly predicts the loop branch as taken.)
Impractical ways:
#Hadi linked an amusing idea: instead of running code normally, compile in a weird way where control and instructions are all data, like for example x86 using only mov instructions with x86 addressing modes + registers. (and an unconditional branch at the bottom of a big block). Is it possible to make decisions in assembly without using `jump` and `goto` at all? This is hilariously inefficient, but doesn't actually have any conditional branches: everything is a data dependency.
It uses different forms of MOV (between registers, and immediate to register, as well as load/store), so it's not a one-instruction-set computer.
A less-insane version of this is an interpreter: Different instructions / operations in the code being interpreted turn into control dependencies in the interpreter (creating a hard-to-solve efficiency problem for the interpret, see Darek Mihocka's "The Common CPU Interpreter Loop Revisited" article, but data and control flow in the guest code are both data in the interpreter. The guest program counter is just another piece of data in the interpreter.

Negative clock cycle measurements with back-to-back rdtsc?

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdtsc two consecutive times, to measure the overhead. I repeat this many times, in a for-loop, and then I use the average value as rdtsc overhead.
Is this correct, to use the average value, first of all?
Nonetheless, the big problem here is that sometimes I get negative values for the overhead (not necessarily the averaged one,but at least the partial ones inside the for loop).
This also affects the consecutive calculation of the number of cpu cycles needed for the sem_wait() operation, which sometimes also turns out to be negative. If what I wrote is not clear, here there's a part of the code I am working on.
Why am I getting such negative values?
(editor's note: see Get CPU cycle count? for a correct and portable way of getting the full 64-bit timestamp. An "=A" asm constraint will only get the low or high 32 bits when compiled for x86-64, depending on whether register allocation happens to pick RAX or RDX for the uint64_t output. It won't pick edx:eax.)
(editor's 2nd note: oops, that's the answer to why we're getting negative results. Still worth leaving a note here as a warning not to copy this rdtsc implementation.)
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
static inline uint64_t get_cycles()
{
uint64_t t;
// editor's note: "=A" is unsafe for this in x86-64
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
int num_measures = 10;
int main ()
{
int i, value, res1, res2;
uint64_t c1, c2;
int tsccost, tot, a;
tot=0;
for(i=0; i<num_measures; i++)
{
c1 = get_cycles();
c2 = get_cycles();
tsccost=(int)(c2-c1);
if(tsccost<0)
{
printf("#### ERROR!!! ");
printf("rdtsc took %d clock cycles\n", tsccost);
return 1;
}
tot = tot+tsccost;
}
tsccost=tot/num_measures;
printf("rdtsc takes on average: %d clock cycles\n", tsccost);
return EXIT_SUCCESS;
}

When Intel first invented the TSC it measured CPU cycles. Due to various power management features "cycles per second" is not constant; so TSC was originally good for measuring the performance of code (and bad for measuring time passed).
For better or worse; back then CPUs didn't really have too much power management, often CPUs ran at a fixed "cycles per second" anyway. Some programmers got the wrong idea and misused the TSC for measuring time and not cycles. Later (when the use of power management features became more common) these people misusing TSC to measure time whined about all the problems that their misuse caused. CPU manufacturers (starting with AMD) changed TSC so it measures time and not cycles (making it broken for measuring the performance of code, but correct for measuring time passed). This caused confusion (it was hard for software to determine what TSC actually measured), so a little later on AMD added the "TSC Invariant" flag to CPUID, so that if this flag is set programmers know that the TSC is broken (for measuring cycles) or fixed (for measuring time).
Intel followed AMD and changed the behaviour of their TSC to also measure time, and also adopted AMD's "TSC Invariant" flag.
This gives 4 different cases:
TSC measures both time and performance (cycles per second is constant)
TSC measures performance not time
TSC measures time and not performance but doesn't use the "TSC Invariant" flag to say so
TSC measures time and not performance and does use the "TSC Invariant" flag to say so (most modern CPUs)
For cases where TSC measures time, to measure performance/cycles properly you have to use performance monitoring counters. Sadly, performance monitoring counters are different for different CPUs (model specific) and requires access to MSRs (privileged code). This makes it considerably impractical for applications to measure "cycles".
Also note that if the TSC does measure time, you can't know what time scale it returns (how many nanoseconds in a "pretend cycle") without using some other time source to determine a scaling factor.
The second problem is that for multi-CPU systems most operating systems suck. The correct way for an OS to handle the TSC is to prevent applications from using it directly (by setting the TSD flag in CR4; so that the RDTSC instruction causes an exception). This prevents various security vulnerabilities (timing side-channels). It also allows the OS to emulate the TSC and ensure it returns a correct result. For example, when an application uses the RDTSC instruction and causes an exception, the OS's exception handler can figure out a correct "global time stamp" to return.
Of course different CPUs have their own TSC. This means that if an application uses TSC directly they get different values on different CPUs. To help people work around the OS's failure to fix the problem (by emulating RDTSC like they should); AMD added the RDTSCP instruction, which returns the TSC and a "processor ID" (Intel ended up adopting the RDTSCP instruction too). An application running on a broken OS can use the "processor ID" to detect when they're running on a different CPU from last time; and in this way (using the RDTSCP instruction) they can know when "elapsed = TSC - previous_TSC" gives an in valid result. However; the "processor ID" returned by this instruction is just a value in an MSR, and the OS has to set this value on each CPU to something different - otherwise RDTSCP will say that the "processor ID" is zero on all CPUs.
Basically; if the CPUs supports the RDTSCP instruction, and if the OS has correctly set the "processor ID" (using the MSR); then the RDTSCP instruction can help applications know when they've got a bad "elapsed time" result (but it doesn't provide anyway of fixing or avoiding the bad result).
So; to cut a long story short, if you want an accurate performance measurement you're mostly screwed. The best you can realistically hope for is an accurate time measurement; but only in some cases (e.g. when running on a single-CPU machine or "pinned" to a specific CPU; or when using RDTSCP on OSs that set it up properly as long as you detect and discard invalid values).
Of course even then you'll get dodgy measurements because of things like IRQs. For this reason; it's best to run your code many times in a loop and discard any results that are too much higher than other results.
Finally, if you really want to do it properly you should measure the overhead of measuring. To do this you'd measure how long it takes to do nothing (just the RDTSC/RDTSCP instruction alone, while discarding dodgy measurements); then subtract the overhead of measuring from the "measuring something" results. This gives you a better estimate of the time "something" actually takes.
Note: If you can dig up a copy of Intel's System Programming Guide from when Pentium was first released (mid 1990s - not sure if it's available online anymore - I have archived copies since the 1980s) you'll find that Intel documented the time stamp counter as something that "can be used to monitor and identify the relative time of occurrence of processor events". They guaranteed that (excluding 64-bit wrap-around) it would monotonically increase (but not that it would increase at a fixed rate) and that it'd take a minimum of 10 years before it wrapped around. The latest revision of the manual documents the time stamp counter with more detail, stating that for older CPUs (P6, Pentium M, older Pentium 4) the time stamp counter "increments with every internal processor clock cycle" and that "Intel(r) SpeedStep(r) technology transitions may impact the processor clock"; and that newer CPUs (newer Pentium 4, Core Solo, Core Duo, Core 2, Atom) the TSC increments at a constant rate (and that this is the "architectural behaviour moving forward"). Essentially, from the very beginning it was a (variable) "internal cycle counter" to be used for a time-stamp (and not a time counter to be used to track "wall clock" time), and this behaviour changed soon after the year 2000 (based on Pentium 4 release date).

do not use avg value
Use the smallest one or avg of smaller values instead (to get avg because of CACHE's) because the bigger ones has been interrupted by OS multi tasking.
You could also remember all values and then found the OS process granularity boundary and filter out all values after this boundary (usually > 1ms which is easily detectable)
no need to measure overhead of RDTSC
You just measure offseted by some time and the same offset is present in both times and after substraction it is gone.
for variable clock source of RDTS (like on laptops)
You should change the speed of CPU to its max by some steady intensive computation loop usually few seconds are enough. You should measure the CPU frequency continuosly and start measure your thing only when it is stable enough.

If you code starts off on one processor then swaps to another, the timestamp difference may be negative due to processors sleeping etc.
Try setting the processor affinity before you start measuring.
I can't see if you are running under Windows or Linux from the question, so I'll answer for both.
Windows:
DWORD affinityMask = 0x00000001L;
SetProcessAffinityMask(GetCurrentProcessId(), affinityMask);
Linux:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
sched_setaffinity (getpid(), sizeof(cpuset), &cpuset)

The other answers are great (go read them), but assume that rdtsc being read correctly. This answer is addressing the inline-asm bug that leads to totally bogus results, including negative.
The other possibility is that you were compiling this as 32-bit code, but with many more repeats, and got an occasional negative interval on CPU migration on a system that doesn't have invariant-TSC (synced TSCs across all cores). Either a multi-socket system, or an older multi-core. CPU TSC fetch operation especially in multicore-multi-processor environment.
If you were compiling for x86-64, your negative results are fully explained by your incorrect "=A" output constraint for asm. See Get CPU cycle count? for correct ways to use rdtsc that are portable to all compilers and 32 vs. 64-bit mode. Or use "=a" and "=d" outputs and simply ignore the high half output, for short intervals that won't overflow 32 bits.)
(I'm surprised you didn't mention them also being huge and wildly-varying, as well as overflowing tot to give a negative average even if no individual measurements were negative. I'm seeing averages like -63421899, or 69374170, or 115365476.)
Compiling it with gcc -O3 -m32 makes it work as expected, printing averages of 24 to 26 (if run in a loop so the CPU stays at top speed, otherwise like 125 reference cycles for the 24 core clock cycles between back-to-back rdtsc on Skylake). https://agner.org/optimize/ for instruction tables.
Asm details of what went wrong with the "=A" constraint
rdtsc (insn ref manual entry) always produces the two 32-bit hi:lo halves of its 64-bit result in edx:eax, even in 64-bit mode where we're really rather have it in a single 64-bit register.
You were expecting the "=A" output constraint to pick edx:eax for uint64_t t. But that's not what happens. For a variable that fits in one register, the compiler picks either RAX or RDX and assumes the other is unmodified, just like a "=r" constraint picks one register and assumes the rest are unmodified. Or an "=Q" constraint picks one of a,b,c, or d. (See x86 constraints).
In x86-64, you'd normally only want "=A" for an unsigned __int128 operand, like a multiple result or div input. It's kind of a hack because using %0 in the asm template only expands to the low register, and there's no warning when "=A" doesn't use both a and d registers.
To see exactly how this causes a problem, I added a comment inside the asm template:
__asm__ volatile ("rdtsc # compiler picked %0" : "=A"(t));. So we can see what the compiler expects, based on what we told it with operands.
The resulting loop (in Intel syntax) looks like this, from compiling a cleaned up version of your code on the Godbolt compiler explorer for 64-bit gcc and 32-bit clang:
# the main loop from gcc -O3 targeting x86-64, my comments added
.L6:
rdtsc # compiler picked rax # c1 = rax
rdtsc # compiler picked rdx # c2 = rdx, not realizing that rdtsc clobbers rax(c1)
# compiler thinks RAX=c1, RDX=c2
# actual situation: RAX=low half of c2, RDX=high half of c2
sub edx, eax # tsccost = edx-eax
js .L3 # jump if the sign-bit is set in tsccost
... rest of loop back to .L6
When the compiler is calculating c2-c1, it's actually calculating hi-lo from the 2nd rdtsc, because we lied to the compiler about what the asm statement does. The 2nd rdtsc clobbered c1
We told it that it had a choice of which register to get the output in, so it picked one register the first time, and the other the 2nd time, so it wouldn't need any mov instructions.
The TSC counts reference cycles since the last reboot. But the code doesn't depend on hi<lo, it just depends on the sign of hi-lo. Since lo wraps around every second or two (2^32 Hz is close to 4.3GHz), running the program at any given time has approximately a 50% chance of seeing a negative result.
It doesn't depend on the current value of hi; there's maybe a 1 part in 2^32 bias in one direction or the other because hi changes by one when lo wraps around.
Since hi-lo is a nearly uniformly distributed 32-bit integer, overflow of the average is very common. Your code is ok if the average is normally small. (But see other answers for why you don't want the mean; you want to median or something to exclude outliers.)

The principal point of my question was not the accuracy of the result, but the fact that I am getting negative values every now and then (first call to rdstc gives bigger value than second call).
Doing more research (and reading other questions on this website), I found out that a way for getting things work when using rdtsc is to put a cpuid command just before it. This command serializes the code. This is how I am doing things now:
static inline uint64_t get_cycles()
{
uint64_t t;
volatile int dont_remove __attribute__((unused));
unsigned tmp;
__asm volatile ("cpuid" : "=a"(tmp), "=b"(tmp), "=c"(tmp), "=d"(tmp)
: "a" (0));
dont_remove = tmp;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
I am still getting a NEGATIVE difference between second call and first call of the get_cycles function. WHY? I am not 100% sure about the syntax of the cpuid assembly inline code, this is what I found looking on the internet.

In the face of thermal and idle throttling, mouse-motion and network traffic interrupts, whatever it's doing with the GPU, and all the other overhead that a modern multicore system can absorb without anyone much caring, I think your only reasonable course for this is to accumulate a few thousand individual samples and just toss the outliers before taking the median or mean (not a statistician but I'll venture it won't make much difference here).
I'd think anything you do to eliminate the noise of a running system will skew the results much worse than just accepting that there's no way you'll ever be able to reliably predict how long it'll take anything to complete these days.

rdtsc can be used to get a reliable and very precise elapsed time. If using linux you can see if your processor supports a constant rate tsc by looking in /proc/cpuinfo to see if you have constant_tsc defined.
Make sure that you stay on the same core. Every core has its own tsc which has its own value. To use rdtsc make sure that you either taskset, or SetThreadAffinityMask (windows) or pthread_setaffinity_np to ensure that your process stays on the same core.
Then you divide this by your main clock rate which on linux can be found in /proc/cpuinfo or you can do this at runtime by
rdtsc
clock_gettime
sleep for 1 second
clock_gettime
rdtsc
then see how many ticks per second, and then you can divide any difference in ticks to find out how much time has elapsed.

If the thread that is running your code is moving between cores then it's possible that the rdtsc value returned is less than the value read on another core. The core's don't all set the counter to 0 at exactly the same time when the package powers up. So make sure you set thread affinity to a specific core when you run your test.

I tested your code on my machine and I figured that during RDTSC fuction only uint32_t is reasonable.
I do the following in my code to correct it:
if(before_t<after_t){ diff_t=before_t + 4294967296 -after_t;}

C Program measure execution time for an instruction

I need to find the time taken to execute a single instruction or a few couple of instructions and print it out in terms of milli seconds. Can some one please share the small code snippet for this.
Thanks.. I need to use this measure the time taken to execute some instructions in my project.

#include<time.h>
main()
{
clock_t t1=clock();
printf("Dummy Statement\n");
clock_t t2=clock();
printf("The time taken is.. %g ", (t2-t1));
Please look at the below liks too.
What’s the correct way to use printf to print a clock_t?
http://www.velocityreviews.com/forums/t454464-c-get-time-in-milliseconds.html

One instruction will take a lot shorter than 1 millisecond to execute. And if you are trying to measure more than one instruction it will get complicated (what about the loop that calls the instruction multiple times).
Also, most timing functions that you can use are just that: functions. That means they will execute instructions also. If you want to time one instruction then the best bet is to look up the specifications of the processor that you are using and see how many cycles it takes.
Doing this programatically isn't possible.
Edit:
Since you've updated your question to now refer to some instructions. You can measure sub-millisecond time on some processors. It would be nice to know the environment. This will work on x86 and linux, other environments will be different.
Clock get time allows forr sub-nanosecond accuracy. Or you can call the rdstc instruction yourself (good luck with this on a multiprocessor or smp system - you could be measuring the wrong thing, eg by having the instruction run on different processors).

The time to actually complete an instruction depends on the clock cycle time, and the depth of the pipeline the instruction traverses through the processor. As dave said, you can't really find this out by making a program. You can use some kind of timing function provided to you by your OS to measure the cpu time it takes to complete some small set of instructions. If you do this, try not to use any kind of instructions that rely on memory, or branching. Ideally you might do some kind of logical or arithmetic operations (so perhaps using some inline assembly in C).

measure time to execute single instruction

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?

Drain the instruction pipeline of Intel Core 2 Duo?

I'm writing some micro-benchmarking code for some very short operations in C. For example, one thing I'm measuring is how many cycles are needed to call an empty function depending on the number of arguments passed.
Currently, I'm timing using an RDTSC instruction before and after each operation to get the CPU's cycle count. However, I'm concerned that instructions issued before the first RDTSC may slow down the actual instructions I'm measuring. I'm also worried that the full operation may not be complete before the second RDTSC gets issued.
Does anyone know of an x86 instruction that forces all in-flight instructions to commit before any new instructions are issued? I've been told CPUID might do this, but I've been unable to find any documentation that says so.

To my knowledge, there is no instruction which specifically "drains" the pipeline. This can be easily accomplished though using a serialising instruction.
CPUID is a serializing instruction, which means exactly what you're looking for. Every instruction issues before it is guaranteed to execute before the CPUID instruction.
So doing the following should get the desired effect:
cpuid
rdtsc
# stuff
cpuid
rdtsc
But, as an aside, I don't recommend that you do this. Your "stuff" can still be effected by a lot of other things outside of your control (such as CPU caches, other processes running on the system, etc), and you'll never be able to eliminate them all. The best way to get accurate performance statistics is to perform the operation(s) you want to measure at least several million times and average out the execution time of the batch.
Edit: Most instruction references for CPUID will mention its serializing properties, such as the NASM manual appendix B .
Edit 2: Also might want to take a look at this related question.