SIMD versions of SHLD/SHRD instructions

SIMD versions of SHLD/SHRD instructions - c

SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts.
Consider the following problem:
uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);
What is the most efficient way to implement left_shift and right_shift, two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer?
Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture?

In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.
Intel Skylake example timings*
The shld/shrd instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a
SHLD RAX,RDX,cl 4 uops, 4 cycle latency. -> 1/16 per bit
Using 2 shifts + add you can do this faster slower.
#Init:
MOV R15,-1
SHR R15,cl //mask for later use.
#Work:
SHL RAX,cl 3 uops, 2 cycle latency
ROL RDX,cl 3 uops, 2 cycle latency
AND RDX,R15 1 uops, 0.25 latency
OR RAX,RDX 1 uops, 0.25 latency
//Still needs unrolling to achieve least amount of slowness.
Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.
//4*64 bits parallel shift.
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3 1uop, 0.5 cycle latency.
However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.
VPSLLVQ YMM1, YMM2, YMM3 1uop, 0.5 cycle latency.
VPSRLVQ YMM5, YMM2, YMM4 1uop, 0.5 cycle latency.
VPOR YMM1, YMM1, YMM5 1uop, 0.33 cycle latency.
You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.
'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles1.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.
I'm not counting the set up of the shift counter in CL/(YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations.
You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.
*) All these timings are best case values, and should be taken as indications, not as hard values.
1) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.

Related

Why in the C language loop, accessing array elements is so much slower than accessing variables？ [duplicate]

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)

TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

how will unrolling affect the cycles per element count CPE

how do I calculate CPE (cycles per element) with these code snippets?
what is the difference in the CPE between the 2 given code snippets?
I have this piece of code
void randomFunction(float a[],float Tb[],float c[],long int n){
int i,j,k;
for(i=0;i<n;++i)
for(j=0;j<n;++j)
for(k=0;k<n-1;k++){
temp+= a[i*n+k] * Tb[j*n+k];
}
}
This is the assembly for the inner-most loop, from GCC10.3 -O2 (https://godbolt.org/z/cWE16rW1r), for a version of the function with a local float temp=0; that it returns so the loop won't be optimized away:
.L4:
movss (%rcx,%rax), %xmm0
mulss (%rdx,%rax), %xmm0
addq $4, %rax
addss %xmm0, %xmm1
cmpq %rax, %rsi
jne .L4
Now I am trying to 'optimise it' using unrolling.
void randomUnrollingFunction(float a[],float Tb[],float c[],long int n){
int i,j,k;
for(i=0;i<n;++i)
for(j=0;j<n;++j)
for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
temp+= a[i*n+k] * Tb[j*n+k];
temp+= a[i*n+k+1]* Tb[j*n+k+1];
}
}
I am wondering what is the the estimated CPE that will be achieved with this unrolling by factor 2.
CPE is number of cycles/ number of instructions
This is the information for the latency:
Thank you for any help in advance!

Your loop entirely bottlenecks on addss latency (float add), at 3 cycles per 1 element. Out-of-order exec lets the other work run in the "shadow" of this. Assuming your Haswell-like CPU has the memory bandwidth to keep up1
This way of unrolling doesn't help at all, not changing the serial dependency chain unless you compiled with -ffast-math or something to let a compiler turn it into
temp1 += a[...+0]* Tb[...+0];
temp2 += a[...+1]* Tb[...+1];
Or add pairs before feeding them into that serial dependency, like
temp += a[]*Tb[] + a[+1]*Tb[+1];
One long serial dependency chain is the worst, and also numerically not great: pairwise summation (or especially just a step in that direction using multiple accumulators) would be numerically better and also perform better. (Simd matmul program gives different numerical results).
(If your multiple accumulators are 4 elements of on SIMD vector, you can do 4x the amount of work with the same pattern of asm instructions. But you then need to unroll with multiple vectors, because addps has the same performance characteristics as addss on modern x86 CPUs.)
Footnote 1: Two sequential read streams of 4 bytes each per 3 cycles; certainly a desktop Haswell could keep up, and probably even a Xeon competing with many other cores for memory bandwidth. But the reads from a[] probably hit in cache, because a[i*n+k] is the same row repeatedly until we move on to the next outer-loop iteration. So only 1 row of a[] has to stay hot in cache (to get hits next middle iteration) while we scan through a row of Tb. So a[] has to come in from DRAM once total, if n isn't huge, but we loop over the whole Tb[] in order n times.
More detailed version
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - look for the dependency chains (in this case the addss into %xmm1). Also A whirlwind introduction to dataflow graphs.
Then look for throughput bottlenecks in the back-end and front-end. In your case, latency dominates. (Assuming the front-end matches this Haswell back-end, although it wouldn't take much to keep up with this back-end latency bottleneck. Also, I hate that they number their "functional units" from 1, instead of following Intel's numbering of ports 0,1,5,6 having ALUs. Haswell's FP adder is on port 1, and ports 2/3 are load/store-AGU units, etc.)
ADDSS has 3 cycle latency, so temp += ... can only execute once per 3 cycles. The load / MULSS operations just independently prepare inputs for ADDSS, and the loop overhead is also independent.
Note that if not for the latency bottleneck, your loop would bottleneck on the front-end on a real Haswell (4 fused-domain uops per cycle), not on back-end functional units. The loop is 5 fused-domain uops, assuming macro-fusion of the cmp/jne, and that Haswell can keep the memory-source addss micro-fused despite the indexed addressing mode. (Sandybridge would un-laminate it.)
In the general case, knowing the back-end functional units is not sufficient. The front-end is also a common bottleneck, especially in loops that have to store something, like an actual matmul.
But thanks to the serial dependency on ADDSS (which actually carries across outer loops), the only thing that matters is that dependency chain.
Even a branch mispredict on the last iteration of the inner loop (when the branch is not-taken instead of normally taken), that will just give the back end more time to chew through those pending ADDSS operations while the front-end sorts itself out and starts in on the next inner loop.
Since you unrolled in a way that doesn't change the serial dependency, it makes zero difference to performance except for tiny n. (For tiny n, the whole thing can potentially overlap some of this work with independent work in the caller before/after the call. In that case, saving instructions could be helpful, also allowing out-of-order exec to "see farther". Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths is a case where OoO exec can (partially) overlap two independent imul dep chains that in program-order are one after the other.
Of course at that point, you're considering code outside what you're showing. And even for n=10, that's 10^3 = 1000 inner iterations, and Haswell's ROB is only 192 uops large, with an RS of 60 entries. (https://www.realworldtech.com/haswell-cpu/3/).
Unrolling in a useful way
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) re: unrolling in ways that do create more instruction-level parallelism, hiding FP dependency chains.
Unrolling differently, only summing once into temp per loop iteration, will keep the same cycles per iteration while still doubling the number of elements you process.
for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
temp += (a[i*n+k] * Tb[j*n+k] +
a[i*n+k+1]* Tb[j*n+k+1]);
}
Obviously you can keep doing this until you run into front-end or back-end throughput limits, like one add per clock. The above version does two adds per 3 clocks.
Your "functional unit" table doesn't list FMA (fused multiply-add), but real Haswell has it, with identical performance to FP mul. Iit wouldn't help much if any because your current loop structure does 2 loads per mul+add, so reducing that to 2 loads and one FMA would still bottleneck on loads. Might help with front-end throughput?
What might help reduce loads is unrolling over one of the outer loops, using both a[i*n+k] and a[(i+1)*n+k] with one Tb[j*n+k]. That of course changes the order of the calculation, so isn't legal for a compiler without -ffast-math because FP math isn't strictly associative.
This is a reduction of a matmul, allowing much better optimization
(err wait, your code didn't show where temp was re-initialized, or what the c[] arg was for. I just assumed it was global or something, but probably you actually butchered a normal matmul function that stores a separate temp after every inner iteration to c[] by taking that out. In that case, OoO exec between separate middle loop iteration is relevant for medium-sized n. But you don't show the scheduler / ROB sizes, and that's not something you can easily model; you need to actually benchmark. So this section is probably only applicable to the question I invented, not what you meant to ask!)
Your loops appear to be summing the elements of a matmul result, but still structured like a matmul. i.e. do a row x column dot product, but instead of storing that into an N x N result[...] matrix, you're just summing the result.
That amounts to summing every pairwise product of elements between two matrices. Since we don't need to keep separate row x column dot products separate anymore, that enables a lot of optimizations! (There's nothing special or ideal about this order of summation; other orders will have different but likely not worse rounding error.)
For example, you don't need to transpose b into Tb, just use b (unless it was naturally already transposed, in which case that's fine). Your matrices are square so it doesn't matter at all.
Furthermore, you can simply load one or a couple elements from a[] and loop over Tb[], doing those products with FMA operations, one load per FMA, into 10 or 12 vector accumulators. (Or of course cache-block this to loop over a contiguous part of Tb that can stay hot in L1d cache.)
That could approach Haswell's max FLOPS throughput of 2x 256-bit FMA per clock = 8 (float elements per YMM vector) x 2 FMAs / clock x 2 FLOP/FMA = 32 FLOP / clock.

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.
I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?
One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.
I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.
static double sqrtsd (double x) {
double r;
__asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
return r;
}

SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.
As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)
div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)
https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/
InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).
Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.
What makes some cases fast?
Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.
It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with #harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.
On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.
(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)
x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.
With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.
Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.
On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.
You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.
Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?
This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)
On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.
Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right.
https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
Yes, there can be performance reasons to choose one vs. the other.
1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.
See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.
However, the delays for passing data between the
different domains or different types of registers are smaller on the
Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. --
Agner Fog's micro arch doc
Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.
2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.
3: Recent Intel CPUs can only run the FP versions on port5.
Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)
Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.
Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.
Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.
How to choose wisely:
Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.
If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.
AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.
If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).
There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)
If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)
For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)
These intrinsics maps to three different x86 instructions (por, orps,
orpd). Does anyone have any ideas why Intel is wasting precious opcode
space for several instructions which do the same thing?
If you look at the history of these instruction sets, you can kind of see how we got here.
por (MMX): 0F EB /r
orps (SSE): 0F 56 /r
orpd (SSE2): 66 0F 56 /r
por (SSE2): 66 0F EB /r
MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.
They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.
Related / near duplicate:
What is the point of SSE2 instructions such as orpd? also summarizes the history. (But I wrote it 5 years later.)
Difference between the AVX instructions vxorpd and vpxor
Does using mix of pxor and xorps affect performance?

According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.

I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.
#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>
int main(void)
{
__m128i a = _mm_set1_epi32(1);
__m128i b = _mm_set1_epi32(2);
__m128i c = _mm_or_si128(a, b);
__m128 x = _mm_set1_ps(1.25f);
__m128 y = _mm_set1_ps(1.5f);
__m128 z = _mm_or_ps(x, y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
return 0;
}
Terminal:
$ gcc -Wall -msse3 por.c -o por
$ ./por
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000

Very fast memcpy for image processing?

I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.
What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?
I expect the solution will either be in assembly or using GCC intrinsics?
I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html
EDIT: note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{
__asm
{
mov esi, src; //src pointer
mov edi, dest; //dest pointer
mov ebx, size; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
loop_copy:
prefetchnta 128[ESI]; //SSE2 prefetch
prefetchnta 160[ESI];
prefetchnta 192[ESI];
prefetchnta 224[ESI];
movdqa xmm0, 0[ESI]; //move data from src to registers
movdqa xmm1, 16[ESI];
movdqa xmm2, 32[ESI];
movdqa xmm3, 48[ESI];
movdqa xmm4, 64[ESI];
movdqa xmm5, 80[ESI];
movdqa xmm6, 96[ESI];
movdqa xmm7, 112[ESI];
movntdq 0[EDI], xmm0; //move data from registers to dest
movntdq 16[EDI], xmm1;
movntdq 32[EDI], xmm2;
movntdq 48[EDI], xmm3;
movntdq 64[EDI], xmm4;
movntdq 80[EDI], xmm5;
movntdq 96[EDI], xmm6;
movntdq 112[EDI], xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
loop_copy_end:
}
}
You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.
You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

The SSE-Code posted by hapalibashi is the way to go.
If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.
That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.
However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

This question is four years old now and I'm a little surprised nobody has mentioned memory bandwidth yet. CPU-Z reports that my machine has PC3-10700 RAM. That the RAM has a peak bandwidth (aka transfer rate, throughput etc) of 10700 MBytes/sec. The CPU in my machine is an i5-2430M CPU, with peak turbo frequency of 3 GHz.
Theoretically, with an infinitely fast CPU and my RAM, memcpy could go at 5300 MBytes/sec, ie half of 10700 because memcpy has to read from and then write to RAM. (edit: As v.oddou pointed out, this is a simplistic approximation).
On the other hand, imagine we had infinitely fast RAM and a realistic CPU, what could we achieve? Let's use my 3 GHz CPU as an example. If it could do a 32-bit read and a 32-bit write each cycle, then it could transfer 3e9 * 4 = 12000 MBytes/sec. This seems easily within reach for a modern CPU. Already, we can see that the code running on the CPU isn't really the bottleneck. This is one of the reasons that modern machines have data caches.
We can measure what the CPU can really do by benchmarking memcpy when we know the data is cached. Doing this accurately is fiddly. I made a simple app that wrote random numbers into an array, memcpy'd them to another array, then checksumed the copied data. I stepped through the code in the debugger to make sure that the clever compiler had not removed the copy. Altering the size of the array alters the cache performance - small arrays fit in the cache, big ones less so. I got the following results:
40 KByte arrays: 16000 MBytes/sec
400 KByte arrays: 11000 MBytes/sec
4000 KByte arrays: 3100 MBytes/sec
Obviously, my CPU can read and write more than 32 bits per cycle, since 16000 is more than the 12000 I calculated theoretically above. This means the CPU is even less of a bottleneck than I already thought. I used Visual Studio 2005, and stepping into the standard memcpy implementation, I can see that it uses the movqda instruction on my machine. I guess this can read and write 64 bits per cycle.
The nice code hapalibashi posted achieves 4200 MBytes/sec on my machine - about 40% faster than the VS 2005 implementation. I guess it is faster because it uses the prefetch instruction to improve cache performance.
In summary, the code running on the CPU isn't the bottleneck and tuning that code will only make small improvements.

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.
I'd benchmark it and see what comes out.

If specific to Intel processors, you might benefit from IPP. If you know it will run with an Nvidia GPU perhaps you could use CUDA - in both cases it may be better to look wider than optimising memcpy() - they provide opportunities for improving your algorithm at a higher level. They are both however reliant on specific hardware.

If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).
If you want to be OS agnostic, try OpenGL.
Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.

If you have access to a DMA engine, nothing will be faster.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight