Very fast memcpy for image processing?

Very fast memcpy for image processing? - c

I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.
What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?
I expect the solution will either be in assembly or using GCC intrinsics?
I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html
EDIT: note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{
__asm
{
mov esi, src; //src pointer
mov edi, dest; //dest pointer
mov ebx, size; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
loop_copy:
prefetchnta 128[ESI]; //SSE2 prefetch
prefetchnta 160[ESI];
prefetchnta 192[ESI];
prefetchnta 224[ESI];
movdqa xmm0, 0[ESI]; //move data from src to registers
movdqa xmm1, 16[ESI];
movdqa xmm2, 32[ESI];
movdqa xmm3, 48[ESI];
movdqa xmm4, 64[ESI];
movdqa xmm5, 80[ESI];
movdqa xmm6, 96[ESI];
movdqa xmm7, 112[ESI];
movntdq 0[EDI], xmm0; //move data from registers to dest
movntdq 16[EDI], xmm1;
movntdq 32[EDI], xmm2;
movntdq 48[EDI], xmm3;
movntdq 64[EDI], xmm4;
movntdq 80[EDI], xmm5;
movntdq 96[EDI], xmm6;
movntdq 112[EDI], xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
loop_copy_end:
}
}
You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.
You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

The SSE-Code posted by hapalibashi is the way to go.
If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.
That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.
However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

This question is four years old now and I'm a little surprised nobody has mentioned memory bandwidth yet. CPU-Z reports that my machine has PC3-10700 RAM. That the RAM has a peak bandwidth (aka transfer rate, throughput etc) of 10700 MBytes/sec. The CPU in my machine is an i5-2430M CPU, with peak turbo frequency of 3 GHz.
Theoretically, with an infinitely fast CPU and my RAM, memcpy could go at 5300 MBytes/sec, ie half of 10700 because memcpy has to read from and then write to RAM. (edit: As v.oddou pointed out, this is a simplistic approximation).
On the other hand, imagine we had infinitely fast RAM and a realistic CPU, what could we achieve? Let's use my 3 GHz CPU as an example. If it could do a 32-bit read and a 32-bit write each cycle, then it could transfer 3e9 * 4 = 12000 MBytes/sec. This seems easily within reach for a modern CPU. Already, we can see that the code running on the CPU isn't really the bottleneck. This is one of the reasons that modern machines have data caches.
We can measure what the CPU can really do by benchmarking memcpy when we know the data is cached. Doing this accurately is fiddly. I made a simple app that wrote random numbers into an array, memcpy'd them to another array, then checksumed the copied data. I stepped through the code in the debugger to make sure that the clever compiler had not removed the copy. Altering the size of the array alters the cache performance - small arrays fit in the cache, big ones less so. I got the following results:
40 KByte arrays: 16000 MBytes/sec
400 KByte arrays: 11000 MBytes/sec
4000 KByte arrays: 3100 MBytes/sec
Obviously, my CPU can read and write more than 32 bits per cycle, since 16000 is more than the 12000 I calculated theoretically above. This means the CPU is even less of a bottleneck than I already thought. I used Visual Studio 2005, and stepping into the standard memcpy implementation, I can see that it uses the movqda instruction on my machine. I guess this can read and write 64 bits per cycle.
The nice code hapalibashi posted achieves 4200 MBytes/sec on my machine - about 40% faster than the VS 2005 implementation. I guess it is faster because it uses the prefetch instruction to improve cache performance.
In summary, the code running on the CPU isn't the bottleneck and tuning that code will only make small improvements.

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.
I'd benchmark it and see what comes out.

If specific to Intel processors, you might benefit from IPP. If you know it will run with an Nvidia GPU perhaps you could use CUDA - in both cases it may be better to look wider than optimising memcpy() - they provide opportunities for improving your algorithm at a higher level. They are both however reliant on specific hardware.

If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).
If you want to be OS agnostic, try OpenGL.
Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.

If you have access to a DMA engine, nothing will be faster.

Related

Why in the C language loop, accessing array elements is so much slower than accessing variables？ [duplicate]

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)

TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

What is the difference between inline assembly call and direct call? [duplicate]

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

Why does Knuth use this clunky decrement?

I'm looking at some of Prof. Don Knuth's code, written in CWEB that is converted to C. A specific example is dlx1.w, available from Knuth's website
At one stage, the .len value of a struct nd[cc] is decremented, and it is done in a clunky way:
o,t=nd[cc].len-1;
o,nd[cc].len=t;
(This is a Knuth-specific question, so maybe you already know that "o," is a preprocessor macro for incrementing "mems", which is a running total of effort expended, as measured by accesses to 64-bit words.) The value remaining in "t" is definitely not used for anything else. (The example here is on line 665 of dlx1.w, or line 193 of dlx1.c after ctangle.)
My question is: why does Knuth write it this way, rather than
nd[cc].len--;
which he does actually use elsewhere (line 551 of dlx1.w):
oo,nd[k].len--,nd[k].aux=i-1;
(And "oo" is a similar macro for incrementing "mems" twice -- but there is some subtlety here, because .len and .aux are stored in the same 64-bit word. To assign values to S.len and S.aux, only one increment to mems would normally be counted.)
My only theory is that a decrement consists of two memory accesses: first to look up, then to assign. (Is that correct?) And this way of writing it is a reminder of the two steps. This would be unusually verbose of Knuth, but maybe it is an instinctive aide-memoire rather than didacticism.
For what it's worth, I've searched in CWEB documentation without finding an answer. My question probably relates more to Knuth's standard practices, which I am picking up bit by bit. I'd be interested in any resources where these practices are laid out (and maybe critiqued) as a block -- but for now, let's focus on why Knuth writes it this way.

A preliminary remark: with Knuth-style literate programming (i.e. when reading WEB or CWEB programs) the “real” program, as conceived by Knuth, is neither the “source” .w file nor the generated (tangled) .c file, but the typeset (woven) output. The source .w file is best thought of as a means to produce it (and of course also the .c source that's fed to the compiler). (If you don't have cweave and TeX handy; I've typeset some of these programs here; this program DLX1 is here.)
So in this case, I'd describe the location in the code as module 25 of DLX1, or subroutine "cover":
Anyway, to return to the actual question: note that this (DLX1) is one of the programs written for The Art of Computer Programming. Because reporting the time taken by a program “seconds” or “minutes” becomes meaningless from year to year, he reports how long a program took in number of “mems” plus “oops”, that's dominated by the “mems”, i.e. the number of memory accesses to 64-bit words (usually). So the book contains statements like “this program finds the answer to this problem in 3.5 gigamems of running time”. Further, the statements are intended to be fundamentally about the program/algorithm itself, not the specific code generated by a specific version of a compiler for certain hardware. (Ideally when the details are very important he writes the program in MMIX or MMIXAL and analyses its operations on the MMIX hardware, but this is rare.) Counting the mems (to be reported as above) is the purpose of inserting o and oo instructions into the program. Note that it's more important to get this right for the “inner loop” instructions that are executed a lot of times, such as everything in the subroutine cover in this case.
This is elaborated in Section 1.3.1′ (part of Fascicle 1):
Timing. […] The running time of a program depends not only on the clock rate but also on the number of functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of 264 virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.
For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that’s what we will do. Each operation will be assumed to take an integer number of υ, where υ (pronounced “oops”) is a unit that represents the clock cycle time in a pipelined implementation. Although the value of υ decreases as technology improves, we always keep up with the latest advances because we measure time in units of υ, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDO (load octa) instruction costs µ + υ, where µ is the average cost of a memory reference. The total running time of a program might be reported as, say, 35µ+ 1000υ, meaning “35 mems plus 1000 oops.” The ratio µ/υ has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that µ and υ deserve to be considered independently.
And he does of course understand the difference from reality:
Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDB (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore µ is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. […] Only the meta-simulator can be trusted to give reliable information about a program’s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That’s why we often resort to the much simpler estimates of Table 1.
Finally, we can use Godbolt's Compiler Explorer to look at the code generated by a typical compiler for this code. (Ideally we'd look at MMIX instructions but as we can't do that, let's settle for the default there, which seems to be x68-64 gcc 8.2.) I removed all the os and oos.
For the version of the code with:
/*o*/ t = nd[cc].len - 1;
/*o*/ nd[cc].len = t;
the generated code for the first line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea r14d, [rax-1]
and for the second line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], r14d
For the version of the code with:
/*o ?*/ nd[cc].len --;
the generated code is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea edx, [rax-1]
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], edx
which as you can see (even without knowing much about x86-64 assembly) is simply the concatenation of the code generated in the former case (except for using register edx instead of r14d), so it's not as if writing the decrement in one line has saved you any mems. In particular, it would not be correct to count it as a single one, especially in something like cover that is called a huge number of times in this algorithm (dancing links for exact cover).
So the version as written by Knuth is correct, for its goal of counting the number of mems. He could also write oo,nd[cc].len--; (counting two mems) as you observed, but perhaps it might look like a bug at first glance in that case. (BTW, in your example in the question of oo,nd[k].len--,nd[k].aux=i-1; the two mems come from the load and the store in --; not two stores.)

This whole practice seems to be based on a mistaken idea/model of how C works, that there's some correspondence between the work performed by the abstract machine and by the actual program as executed (i.e. the "C is portable assembler" fallacy). I don't think we can answer a lot more about why that exact code fragment appears, except that it happens to be an unusual idiom for counting loads and stores on the abstract machine as separate.

Which one is faster?

I am using SSE2 in gcc 4.4.3. In my program, I need to use say least (0 - 7) 8-bits of a 128-bit SIMD register. Please suggest a way in which I can retrieve the 8-bits quickly.
I tried with _mm_movepi64_pi64 or _mm_extract_epi16, both of which gives similar performance in my program. I was trying with union approach also. union{__m128i a1, int a2[4]}. Though, in the test case, it gave good results, in my program, this approach was not very good.
Any ideas.. (which of the above mentioned three ways I should use?)

_mm_movepi64_pi64 moves from XMM to MMX registers. There's no way it's the right choice, unless you want to do some more SIMD in MMX registers, and your code runs out of XMM regs.
If you want the bits as an array index or something, they have to be in a GP register, in which case you want SSE4.1 _mm_extract_epi8.
If you need to stick to SSE2, this should be the fastests way to get byte 5 of xmm0:
pextrw eax, xmm0, 2
movzx eax, ah
So this should hopefully get the compiler to be efficient like that:
(uint8_t)(_mm_extract_epi16(var, n/2) >> ((n%2) * 8))
Less efficient would be a shift-by-bytes _mm_bsrli_si128 (psrldq) to put the byte you want into the low byte of an xmm reg, then movd (_mm_extract_epi16(var, 0) emits movd, not pextrw r32, xmm, 0, fortunately). This way you don't have to do anything extra if the byte you want is an odd-numbered byte that pextw would leave in the high 8 of a result. Still no easy way to use this with an index that isn't a compile-time constant.
Storing 16B to memory and loading the element you want should be fairly good. (What you'll probably get with the union approach, unless the compiler optimizes it to a pextract instruction). The compiler will use a 16B-aligned location on the stack. Thus store->load forwarding should work fine in this case, so the latency will be low. If you need two separate elements into two separate integer variables, this is probably the best choice, maybe beating multiple pextrw

Efficiency of C vs Assembler

How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?

Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.

Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".

If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF

In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.

If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.

The answer depends, as you've been able to see here, upon many things. What the compiler will do with your C code depends on a lot of things. If we're talking x86-32 the following should be generally applicable.
At the basic level your C code indicates a memory variable which would require at least one instruction to multiply by two: "shl mem,1" and in such a simple case the C code will be slower.
If num is a local variable the compiler may decide to put it in a register (if it is used often enough and/or the function is small enough) and then you will have your "shl reg,1" instruction - maybe.
What instruction is fastest has everything to do with how they are implemented in the processor. Shl may not be the best choice since it affects the C and Z flags which slows it down. A few years ago the recommendation was "lea reg,[reg+reg]" (all reg are the same) because lea didn't affect any flags and there were variants such as (using the eax register on x86-32 platform as an example):
lea eax,[eax+eax] ; *2
lea eax,[eax+eax*2] ; *3
lea eax,[eax+eax*4] ; *5
lea eax,[eax+eax*8] ; *9
I don't know what is the norm today but your compiler probably will.
As for measuring search for information here on the rdtsc instruction which is the hands-down best alternative as it counts actual clock cycles.

Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.

If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.

If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight