Super-fast rounding function (PBC) - c

I really need very fast round() function in C -
it is necessary for Monte Carlo particle modeling:
at every step you need to wrap coordinates into periodic box to compute volume interactions : for example
for(int i=0; i < 3; i++)
{
coor.x[i] = a.XReal.x[i]-b.XReal.x[i];
coor.x[i] = coor.x[i] - SIZE[i]*round(coor.x[i]/SIZE[i]); //PBC
}
I've come across some asm hacking with it, but i don't understand asm at all:)
something like this
inline int float2int2(float flt)
{
int intgr;
__asm__ __volatile__ ("fld %1; fistp %0;" : "=m" (intgr) : "m" (flt));
return intgr;
}
With fixed boundaries, without round() it works faster.
So, maybe someone knows a better way?..

First of all, you can get some gains by using the right compiler options. With GCC and a modern Intel CPU for example, you should try:
-march=nehalem -fno-trapping-math
Then the problem with round is that it uses a specific rounding mode, which is slow on most platforms. nearbyint (or rint) should always be faster:
coor.x[i] = coor.x[i] - SIZE[i] * nearbyint(coor.x[i] / SIZE[i])
Have a look a the generated assembly.
You should also think about vectorizing your code.

Instead of looking for just fast rounding, ideally you want the whole process of range-reduction into the periodic box to be fast. As #EOF accurately pointed out in a comment, you could use a C99 standard function like remainderf() or fmodf().
coor.x[i] -= SIZE[i]*round(coor.x[i]/SIZE[i]);
// same as
coor.x[i] = remainderf(coor.x[i], SIZE[i]);
fmodf(3) rounds towards zero, remainderf(3) rounds towards nearest.
The remainder() function computes the remainder of dividing x by y. The return value is x-n*y, where n is the value x / y, rounded
to the nearest integer. If the absolute value of x-n*y is 0.5, n is chosen to be even.
Compilers / libraries have several different strategies for implementing these. With -ffast-math, gcc 5.3 for x86-64 inlines a remainder(x,y) implementation that transfers the values from SSE registers to x87 registers, and runs FPREM1 (partial remainder) in a loop until it sets a flag indicating that the result is correct. (One execution of FPREM1 can reduce the exponent by at most 63).
clang always emits a call to the library function, either the normal remainder entry point, or __remainder_finite with -ffast-math.
The GNU libm definition uses mostly integer operations, AFAICT from the disassembly and the C source. On a recent Intel CPU with fast hardware divide, it might be slower than your div, round, mul version.
So you have three options:
div, round, mul, sub, with fast rounding (use nearbyint(), it apparently has the least ugly semantics so it can inline to roundsd / roundss most easily). This way can vectorize, and do all three coordinates at once. May need to do it manually, to find something that won't fault for the 4th element. On Intel Haswell with 128b vectors: 5 uops. single-precision: divps(10-13c latency, one per 7c throughput), roundps(2 uops, 6c latency, one per 2c throughput), mulps(5c latency, one per 0.5c throughput), subps(3c latency, one per 1c throughput). Some of these compete with each other for execution ports. Total latency: 27c. Probable throughput, maybe something like one per 7c (totally bottlenecked by divps)
gcc's inlined x87 FPREM1. (probably only needs to run one iteration, so on Haswell: 41 uops, 27c latency, one per 17c throughput, plus some overhead for getting data between xmm and x87 regs. Can't vectorize.
glibc's mostly-integer implementation: no idea, probably worse than either of the other two, on modern x86 CPUs. But, probably significantly higher accuracy than the manual div/round/mul/sub.
Bottom line, if this is a speed issue, you should definitely look into vectorizing with SSE/AVX to do all three coordinates of a point in one vector. Or, a coordinate of four points at once, or whatever is convenient. Ideally you can make use of all 4 (or 8 with AVX) single-precision elements of the vector ALUs. (or 2 / 4 for double-precision).
Even scalar, I think your current code with nearbyint() is going to be the fastest choice, but you can easily go three times faster than that with vectors.

Related

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.
I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?
One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.
I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.
static double sqrtsd (double x) {
double r;
__asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
return r;
}
SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.
As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)
div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)
https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/
InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).
Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.
What makes some cases fast?
Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.
It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with #harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.
On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.
(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)
x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.
With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.
Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.
On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.
You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.
Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?
This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)
On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.
Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right.
https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!
yeah, I know fsqrt. But how the CPU does it? I can't debug hardware
Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.
Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.
Maybe also see
The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations
But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication
For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.
Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).
sqrt is defined by C, so most likely you have to look in glibc.
You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrt.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtf.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtl.c
tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:
https://www.felixcloutier.com/x86/sqrtss
https://www.felixcloutier.com/x86/sqrtsd
Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:
https://godbolt.org/z/Wb4unC

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled
Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.
If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

Determine FLOPS of our ASM program

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?
Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
Yes, there can be performance reasons to choose one vs. the other.
1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.
See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.
However, the delays for passing data between the
different domains or different types of registers are smaller on the
Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. --
Agner Fog's micro arch doc
Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.
2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.
3: Recent Intel CPUs can only run the FP versions on port5.
Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)
Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.
Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.
Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.
How to choose wisely:
Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.
If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.
AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.
If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).
There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)
If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)
For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)
These intrinsics maps to three different x86 instructions (por, orps,
orpd). Does anyone have any ideas why Intel is wasting precious opcode
space for several instructions which do the same thing?
If you look at the history of these instruction sets, you can kind of see how we got here.
por (MMX): 0F EB /r
orps (SSE): 0F 56 /r
orpd (SSE2): 66 0F 56 /r
por (SSE2): 66 0F EB /r
MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.
They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.
Related / near duplicate:
What is the point of SSE2 instructions such as orpd? also summarizes the history. (But I wrote it 5 years later.)
Difference between the AVX instructions vxorpd and vpxor
Does using mix of pxor and xorps affect performance?
According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.
I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.
#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>
int main(void)
{
__m128i a = _mm_set1_epi32(1);
__m128i b = _mm_set1_epi32(2);
__m128i c = _mm_or_si128(a, b);
__m128 x = _mm_set1_ps(1.25f);
__m128 y = _mm_set1_ps(1.5f);
__m128 z = _mm_or_ps(x, y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
return 0;
}
Terminal:
$ gcc -Wall -msse3 por.c -o por
$ ./por
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000

Resources