What's the difference between logical SSE intrinsics?

What's the difference between logical SSE intrinsics? - c

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
Yes, there can be performance reasons to choose one vs. the other.
1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.
See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.
However, the delays for passing data between the
different domains or different types of registers are smaller on the
Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. --
Agner Fog's micro arch doc
Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.
2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.
3: Recent Intel CPUs can only run the FP versions on port5.
Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)
Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.
Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.
Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.
How to choose wisely:
Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.
If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.
AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.
If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).
There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)
If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)
For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)
These intrinsics maps to three different x86 instructions (por, orps,
orpd). Does anyone have any ideas why Intel is wasting precious opcode
space for several instructions which do the same thing?
If you look at the history of these instruction sets, you can kind of see how we got here.
por (MMX): 0F EB /r
orps (SSE): 0F 56 /r
orpd (SSE2): 66 0F 56 /r
por (SSE2): 66 0F EB /r
MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.
They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.
Related / near duplicate:
What is the point of SSE2 instructions such as orpd? also summarizes the history. (But I wrote it 5 years later.)
Difference between the AVX instructions vxorpd and vpxor
Does using mix of pxor and xorps affect performance?

According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.

I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.
#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>
int main(void)
{
__m128i a = _mm_set1_epi32(1);
__m128i b = _mm_set1_epi32(2);
__m128i c = _mm_or_si128(a, b);
__m128 x = _mm_set1_ps(1.25f);
__m128 y = _mm_set1_ps(1.5f);
__m128 z = _mm_or_ps(x, y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
return 0;
}
Terminal:
$ gcc -Wall -msse3 por.c -o por
$ ./por
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000

Related

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.
Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
If not, in which situation should I use this over _mm_bslli_si128?
Is there an assembly instruction which does this correctly?
What is the best way of emulating this with smaller shifts?

1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.
2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.
As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.
template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
static_assert( i >= 0 && i < 128 );
// Handle couple trivial cases
if constexpr( 0 == i )
return vec;
if constexpr( 0 == ( i % 8 ) )
return _mm_slli_si128( vec, i / 8 );
if constexpr( i > 64 )
{
// Shifting by more than 8 bytes, the lowest half will be all zeros
vec = _mm_slli_si128( vec, 8 );
return _mm_slli_epi64( vec, i - 64 );
}
else
{
// Shifting by less than 8 bytes.
// Need to propagate a few bits across 64-bit lanes.
__m128i low = _mm_slli_si128( vec, 8 );
__m128i high = _mm_slli_epi64( vec, i );
low = _mm_srli_epi64( low, 64 - i );
return _mm_or_si128( low, high );
}
}

TL:DR: They're synonyms; the bslli name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128 was in widespread usage). I find it clearer and would recommend it for new development.
SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32 / vpxord)
You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64, with special cases for c%8 reducing to a byte-shift. Existing SO Q&As cover that, or see #Soonts' answer on this Q.
Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i)) can compile to movd / psllq xmm, xmm. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.
bslli / bsrli are new, clearer intrinsic names for the same asm instructions
The b names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)
gcc since 4.8
clang since 3.7
ICC since ICC13 or earlier, Godbolt doesn't have any older
MSVC since 19.14 or earlier, Godbolt doesn't have any older
If you check the intrinsics guide, _mm_slli_si128 is listed as an intrinsic for PSLLDQ, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).
Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1 / pslld / psllq / pslldq. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)
The asm manual entry for pslldq in turn lists intrinsics for forms of it, interestingly only using the b name for the __m512i AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128 and _mm256_slli_si256 were the only names available, I think. Certainly it post-dates SSE2 intrinsics.
(Note that the si256 and si512 versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)
I think this new bslli name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void* instead of __m512i*, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void* is allowed. (Creating a misaligned __m512i* is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.
(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*) - you'd guess that's a strict-aliasing-safe way to do a 32-bit movd load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem] with no masking. It's just _mm_loadu_si128 with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32. It would be nice to have void* load / store intrinsics for __m128i and __m256i, but if they have misleading names like that (esp. when you aren't using the mask/maskz versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) ) casts for the old intrinsic, because I love typing 256 three times. >.<
I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.
It somewhat but not entirely helps to note the difference between epi16/32/64 and si128: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128 means a whole 128-bit integer vector.
There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.

find nan in array of doubles using simd

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?

You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.
I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?
One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.
I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.
static double sqrtsd (double x) {
double r;
__asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
return r;
}

SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.
As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)
div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)
https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/
InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).
Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.
What makes some cases fast?
Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.
It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with #harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.
On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.
(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)
x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.
With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.
Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.
On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.
You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.
Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?
This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)
On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.
Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right.
https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

Any preference to SHUFPD or PSHUFD for reversing two packed double in an XMM?

Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.

You'll always get correct results, but it can matter for performance.
Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).
This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.
If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)
http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.
Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.
movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.
Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).
Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.
IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.
There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.
re: your inline asm:
It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.
You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.
See the x86 tag wiki for a link to my inline asm guide.

Super-fast rounding function (PBC)

I really need very fast round() function in C -
it is necessary for Monte Carlo particle modeling:
at every step you need to wrap coordinates into periodic box to compute volume interactions : for example
for(int i=0; i < 3; i++)
{
coor.x[i] = a.XReal.x[i]-b.XReal.x[i];
coor.x[i] = coor.x[i] - SIZE[i]*round(coor.x[i]/SIZE[i]); //PBC
}
I've come across some asm hacking with it, but i don't understand asm at all:)
something like this
inline int float2int2(float flt)
{
int intgr;
__asm__ __volatile__ ("fld %1; fistp %0;" : "=m" (intgr) : "m" (flt));
return intgr;
}
With fixed boundaries, without round() it works faster.
So, maybe someone knows a better way?..

First of all, you can get some gains by using the right compiler options. With GCC and a modern Intel CPU for example, you should try:
-march=nehalem -fno-trapping-math
Then the problem with round is that it uses a specific rounding mode, which is slow on most platforms. nearbyint (or rint) should always be faster:
coor.x[i] = coor.x[i] - SIZE[i] * nearbyint(coor.x[i] / SIZE[i])
Have a look a the generated assembly.
You should also think about vectorizing your code.

Instead of looking for just fast rounding, ideally you want the whole process of range-reduction into the periodic box to be fast. As #EOF accurately pointed out in a comment, you could use a C99 standard function like remainderf() or fmodf().
coor.x[i] -= SIZE[i]*round(coor.x[i]/SIZE[i]);
// same as
coor.x[i] = remainderf(coor.x[i], SIZE[i]);
fmodf(3) rounds towards zero, remainderf(3) rounds towards nearest.
The remainder() function computes the remainder of dividing x by y. The return value is x-n*y, where n is the value x / y, rounded
to the nearest integer. If the absolute value of x-n*y is 0.5, n is chosen to be even.
Compilers / libraries have several different strategies for implementing these. With -ffast-math, gcc 5.3 for x86-64 inlines a remainder(x,y) implementation that transfers the values from SSE registers to x87 registers, and runs FPREM1 (partial remainder) in a loop until it sets a flag indicating that the result is correct. (One execution of FPREM1 can reduce the exponent by at most 63).
clang always emits a call to the library function, either the normal remainder entry point, or __remainder_finite with -ffast-math.
The GNU libm definition uses mostly integer operations, AFAICT from the disassembly and the C source. On a recent Intel CPU with fast hardware divide, it might be slower than your div, round, mul version.
So you have three options:
div, round, mul, sub, with fast rounding (use nearbyint(), it apparently has the least ugly semantics so it can inline to roundsd / roundss most easily). This way can vectorize, and do all three coordinates at once. May need to do it manually, to find something that won't fault for the 4th element. On Intel Haswell with 128b vectors: 5 uops. single-precision: divps(10-13c latency, one per 7c throughput), roundps(2 uops, 6c latency, one per 2c throughput), mulps(5c latency, one per 0.5c throughput), subps(3c latency, one per 1c throughput). Some of these compete with each other for execution ports. Total latency: 27c. Probable throughput, maybe something like one per 7c (totally bottlenecked by divps)
gcc's inlined x87 FPREM1. (probably only needs to run one iteration, so on Haswell: 41 uops, 27c latency, one per 17c throughput, plus some overhead for getting data between xmm and x87 regs. Can't vectorize.
glibc's mostly-integer implementation: no idea, probably worse than either of the other two, on modern x86 CPUs. But, probably significantly higher accuracy than the manual div/round/mul/sub.
Bottom line, if this is a speed issue, you should definitely look into vectorizing with SSE/AVX to do all three coordinates of a point in one vector. Or, a coordinate of four points at once, or whatever is convenient. Ideally you can make use of all 4 (or 8 with AVX) single-precision elements of the vector ALUs. (or 2 / 4 for double-precision).
Even scalar, I think your current code with nearbyint() is going to be the fastest choice, but you can easily go three times faster than that with vectors.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight