In C code it is common to write
a = b*b;
instead of
a = pow(b, 2.0);
for double variables. I get that since pow is a generic function capable of handling non-integer exponents, one should naïvely think that the first version is faster. I wonder however whether the compiler (gcc) transforms calls to pow with integer exponents to direct multiplication as part of any of the optional optimizations.
Assuming that this optimization does not take place, what is the largest integer exponent for which it is faster to write out the multiplication manually, as in b*b* ... *b?
I know that I could make performance tests on a given machine to figure out whether I should even care, but I would like to gain some deeper understanding on what is "the right thing" to do.
What you want is -ffinite-math-only -ffast-math and possibly #include <tgmath.h> This is the same as -Ofast without mandating the -O3 optimizations.
Not only does it help these kinds of optimizations when -ffinite-math-only and -ffast-math is enabled, the type generic math also helps compensate for when you forget to append the proper suffix to a (non-double) math function.
For example:
#include <tgmath.h>
float pow4(float f){return pow(f,4.0f);}
//compiles to
pow4:
vmulss xmm0, xmm0, xmm0
vmulss xmm0, xmm0, xmm0
ret
For clang this works for powers up to 32, while gcc does this for powers up to at least 2,147,483,647 (that's as far as I checked) unless -Os is enabled (because a jmp to the pow function is technically smaller) - with -Os, it will only do a power of 2.
WARNING -ffast-math is just a convenience alias to several other optimizations, many of which break all kinds of standards. If you'd rather use only the minimal flags to get this desired behavior, then you can use -fno-math-errno -funsafe-math-optimizations -ffinite-math-only
In terms of the right thing to - consider your maintainer not just performance. I have a hunch you are looking for a general rule. If you are doing a simple and consistent square or cube of a number, I would not use pow for these. pow will most likely be making some form of a subroutine call versus performing register operations (which is why Martin pointed out architecture dependendency).
Related
I'm a beginner and working on AVX2 architecture and I would like to use an intrinsic which does the same functionality of the _mm_min_round_ss in AVX-512. So Is there any intrinsic which is similar to this?
Rounding-mode override and FP-exception suppression (with per-instruction overrides) are unique to AVX-512. (These are the ..._round_... versions of scalar and 512-bit intrinsics; packed 128-bit and 256-bit vector instructions don't have room to encode the SAE stuff in the EVEX prefix, they need some of those bits to signal the narrower vector length.)
Does the rounding mode ever make a difference for vminps? I think no, since it's a compare, not actually rounding a new result. I guess suppressing exceptions can, in case you're going to check fenv later to see if anything set the denormal or invalid flags or something? The Intrinsics guide only mentions _MM_FROUND_NO_EXC as relevant, not overrides to floor/ceil/trunc rounding.
If you don't need exception suppression, just use the normal scalar or packed ..._min_ps / ss intrinsic, e.g. _mm256_min_ps (8 floats in a __m256 vector) or _mm_min_ss (scalar, just the low element of a __m128 vector, leaving others unmodified).
See What is the instruction that gives branchless FP min and max on x86? for details on exact FP semantics (not symmetric wrt. NaN), and the fact that until quite recently, GCC treated the intrinsic as commutative even though the instruction isn't. (Other compilers, and current GCC, only do that with -ffast-math)
Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!
yeah, I know fsqrt. But how the CPU does it? I can't debug hardware
Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.
Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.
Maybe also see
The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations
But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication
For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.
Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).
sqrt is defined by C, so most likely you have to look in glibc.
You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrt.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtf.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtl.c
tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:
https://www.felixcloutier.com/x86/sqrtss
https://www.felixcloutier.com/x86/sqrtsd
Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:
https://godbolt.org/z/Wb4unC
To quote (thanks to the author for developing and sharing the algorithm!):
https://tavianator.com/fast-branchless-raybounding-box-intersections/
Since modern floating-point instruction sets can compute min and max without branches
Corresponding code by the author is just
dmnsn_min(double a, double b)
{
return a < b ? a : b;
}
I'm familiar with e.g. _mm_max_ps, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.
Question:
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
Is it safe to assume it's going to be applied, or how do I call it?
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.
Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).
GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.
Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.
Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.
MINSS a,b exactly implements (a<b) ? a : b according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b, on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a), because those functions are based on the same expression. Note the b,a operand order for the std:: functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.
MAXSS a,b exactly implements (b<a) ? a : b, again keeping the source operand (b) on unordered. Like std::max(b,a).
Looping over an array with x = std::min(arr[i], x); (i.e. minss or maxss xmm0, [rsi]) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.
Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]); (or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.
(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)
Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/
But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):
// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
return a < b ? a : b;
}
Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.
(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)
The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.
I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.
Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss as the intrinsic for MINSS.
When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)
Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.
Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.
As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.
The commutative-_mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).
You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.
(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)
In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax for performance unless you need their NaN behaviour.
In C, I think just write your own min and max functions (or macros if you do it safely).
C & asm on the Godbolt compiler explorer
float minfloat(float a, float b) {
return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
minss xmm0, xmm1
ret
// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
# This implementation of std::min uses (b<a) : b : a;
# So it can produce the result only in the register that b was in
# This isn't worse (when inlined), just opposite
minss xmm1, xmm0
movaps xmm0, xmm1
ret
float minfloat_fmin(float a, float b) { return fminf(a, b); }
# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
movaps xmm2, xmm0
cmpunordss xmm2, xmm2
movaps xmm3, xmm2
andps xmm3, xmm1
minss xmm1, xmm0
andnps xmm2, xmm1
orps xmm2, xmm3
movaps xmm0, xmm2
ret
# Obviously you don't want this if you don't need it.
If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.
If you don't expect NaNs, or want to handle them specially, write stuff like
lowest = _mm_min_ps(lowest, some_loop_variable);
so the register holding lowest can be updated in-place (even without AVX).
Taking advantage of MINPS's NaN behaviour:
Say your scalar code is something like
if(some condition)
lowest = min(lowest, x);
Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31.)
The straight-forward way to implement this would be to blend x with +Inf based on the condition mask. Or do newval = min(lowest, x); and blend newval into lowest. (either BLENDVPS or AND/ANDNOT/OR).
But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:
__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;
x = _mm_or_ps(x, condition); // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest); // NaN elements in x mean no change in lowest
// REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
// AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.
So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).
The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).
Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers:
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
I was referring to minss/minsd. And even other architectures without such instructions should be able to do this branchlessly with conditional moves.
Is it safe to assume it's going to be applied, or how do I call it?
gcc and clang will both optimize (a < b) ? a : b to minss/minsd, so I don't bother using intrinsics. Can't speak to other compilers though.
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
The individual a < b tests are pretty much completely unpredictable, so it is very important to avoid branching for those. Tests like if (ray.dir.x != 0.0) are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize. The most important part is probably removing the divisions though.
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Yes, minss/minsd behave exactly like (a < b) ? a : b, including their treatment of infinities and NaNs.
Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.
Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.
You'll always get correct results, but it can matter for performance.
Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).
This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.
If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)
http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.
Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.
movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.
Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).
Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.
IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.
There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.
re: your inline asm:
It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.
You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.
See the x86 tag wiki for a link to my inline asm guide.
I have an application that was developed for Linux x86 32 bits. There are lots of floating-point operations and a lot of tests depending on the results. Now we are porting it to x86_64, but the test results are different in this architecture. We don't want to keep a separate set of results for each architecture.
According to the article An Introduction to GCC - for the GNU compilers gcc and g++ the problem is that GCC in X86_64 assumes fpmath=sse while x86 assumes fpmath=387. The 387 FPU uses 80 bit internal precision for all operations and only convert the result to a given floating-point type (float, double or long double) while SSE uses the type of the operands to determine its internal precision.
I can force -mfpmath=387 when compiling my own code and all my operations work correctly, but whenever I call some library function (sin, cos, atan2, etc.) the results are wrong again. I assume it's because libm was compiled without the fpmath override.
I tried to build libm myself (glibc) using 387 emulation, but it caused a lot of crashes all around (don't know if I did something wrong).
Is there a way to force all code in a process to use the 387 emulation in x86_64? Or maybe some library that returns the same values as libm does on both architectures? Any suggestions?
Regarding the question of "Do you need the 80 bit precision", I have to say that this is not a problem for an individual operation. In this simple case the difference is really small and makes no difference. When compounding a lot of operations, though, the error propagates and the difference in the final result is not so small any more and makes a difference. So I guess I need the 80 bit precision.
I'd say you need to fix your tests. You're generally setting yourself up for disappointment if you assume floating point math to be accurate. Instead of testing for exact equality, test whether it's close enough to the expected result. What you've found isn't a bug, after all, so if your tests report errors, the tests are wrong. ;)
As you've found out, every library you rely on is going to assume SSE floating point, so unless you plan to compile everything manually, now and forever, just so you can set the FP mode to x87, you're better off dealing with the problem now, and just accepting that FP math is not 100% accurate, and will not in general yield the same result on two different platforms. (I believe AMD CPU's yield slightly different results in x87 math as well).
Do you absolutely need 80-bit precision? (If so, there obviously aren't many alternatives, other than to compile everything yourself to use 80-bit FP.)
Otherwise, adjust your tests to perform comparisons and equality tests within some small epsilon. If the difference is smaller than that epsilon, the values are considered equal.
80 bit precision is actually dangerous. The problem is that it is actually preserved as long as the variable is stored in the CPU register. Whenever it is forced out to RAM, it is truncated to the type precision. So you can have a variable actually change its value even though nothing happened to it in the code.
If you want long double precision, use long double for all of your floating point variables, rather than expecting float or double to have extra magic precision. This is really a no-brainer.
SSE floating point and 387 floating point use entirely different instructions, and so there's no way to convince SSE fp instructions to use the 387. Probably the best way to deal with this is resign your test suite to getting slightly different results, and not depend on results being the same to the last bit.