Since the gcc option -ffast-math effectively disables NaN and -/+inf, I'm looking for maybe the next best option for representing NaN in my performance-critical math code. Ideally the sentinel value if operated on (add, mul, div, sub, etc..) would yield the sentinel value as NaN would do but I doubt this would be possible since I think NaN is the only value that accomplishes this. -0.0 might not be a good fit as it's also disabled in -ffast-math and could prevent certain optimizations like (x+0.0), etc..
Perhaps my question should rather be, is there any way to use NaN or some other "special double" while being able to enable a lot of the math optimizations without breaking down?
System is Linux/x64, gcc 4.8.1.
If you are looking for a value which would be propagated by arithmetic operations, NaN is still available with option -ffast-math. The problem lies somewhere else. With -ffast-math some operations can removed from the computation due to optimization, and then there is no way to guarantee NaN or any other value would be propagates.
For example, the following, with -ffast-math set, will cause hard writing 0.0 into n and there is no special value for n which would protect from it.
float n = NAN;
n *= 0.0;
One thing you can do, is to use -fno-finite-math-only -ftrapping-math with -ffast-math as Shafik Yaghmour said. And the other is, if there are only few places where you expect a bad value, you can check for it by yourself putting tests exactly in those points.
The last option I can think -- if you really badly need optimization -- is to manually inject NaN (and maybe inf) values into the computation and check for how long it is propagated. Then in those places where the propagation stops, test for NaN (inf) occurrence. -- This is an unsafe method, as I am not one hundred percent sure, can -ffast-math involve conditional flow of operations. If it can, there is a significant chance, this solution will be invalid. So it is risky and if chosen needs very heavy testing covering all branches of the computation.
Normally I would be rather against the last solution, but actually there is a chance, NaN (inf) values will be propagated though the whole computation or almost whole, so it can give the performance you are seeking for. So you may want to take the risk.
Checking for NaN with -ffast-math you can do, as Shafik Yaghmour said, with
inline int isnan(float f)
{
union { float f; uint32_t x; } u = { f };
return (u.x << 1) > 0xff000000u;
}
and for double with
inline int isnan(double d)
{
union { double d; uint64_t x; } u = { d };
return (u.x << 1) > 0xff70000000000000ull;
}
Checking for inf would be
inline int isinf(float f)
{
union { float f; uint32_t x; } u = { f };
return (u.x << 1) == 0xff000000u;
}
inline int isinf(double d)
{
union { double d; uint64_t x; } u = { d };
return (u.x << 1) == 0xff70000000000000ull;
}
You can also merge isnan and isinf.
Related
Problem
In using automated CI tests I found some code, which breaks if optimization of gcc is set to -O2. The code should increment a counter if a double value crosses a threshold in either direction.
Workaround
Going down to -O1 or using -ffloat-store option works around the problem.
Example
Here is a small example which shows the same problem.
The update() function should return true whenever a sequence of *pNextState * 1e-6 crosses a threshold of 0.03.
I used call by reference because the values are part of a large struct in the full code.
The idea behind using < and >= is that if a sequence hits the value exactly, the function should return 1 this time and return 0 the next cycle.
test.h:
extern int update(double * pState, double * pNextState);
test.c:
#include "test.h"
int update(double * pState, double * pNextState_scaled) {
static double threshold = 0.03;
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
return oldState < threshold && *pState >= threshold;
}
main.c:
#include <stdio.h>
#include <stdlib.h>
#include "test.h"
int main(void) {
double state = 0.01;
double nextState1 = 20000.0;
double nextState2 = 30000.0;
double nextState3 = 40000.0;
printf("%d\n", update(&state, &nextState1));
printf("%d\n", update(&state, &nextState2));
printf("%d\n", update(&state, &nextState3));
return EXIT_SUCCESS;
}
Using gcc with at least -O2 the output is:
0
0
0
Using gcc with -O1, -O0 or -ffloat-store produces the desired output
0
1
0
As i understand the problem from debugging a problem arises if the compiler optimizes out the local variable oldstate on stack and instead compares against an intermediate result in an floting point register with higher precision (80bit) and the value *pState is a tiny bit smaller than the threshold.
If the value for comparison is stored in 64bit precision, the logic can't miss crossing the threshold. Because of the multiplication by 1e-6 the result is probably stored in an floating point register.
Would you consider this a gcc bug?
clang does not show the problem.
I am using gcc version 9.2.0 on an Intel Core i5, Windows and msys2.
Update
It is clear to me that floating point comparison is not exact and i would consider the following result as valid:
0
0
1
The idea was that, if (*pState >= threshold) == false in one cycle then comparing the same value (oldstate = *pState) against the same threshold in a subsequent call (*pState < threshold) has to be true.
[Disclaimer: This is a generic, shoot-from-the-hip answer. Floating-point issues can be subtle, and I have not analyzed this one carefully. Once in a while, suspicious-looking code like this can be made to work portably and reliably after all, and per the accepted answer, that appears to be the case here. Nevertheless, the general-purpose answer stands in the, well, general case.]
I would consider this a bug in the test case, not in gcc. This sounds like a classic example of code that's unnecessarily brittle with respect to exact floating-point equality.
I would recommend either:
rewriting the test case, or perhaps
removing the test case.
I would not recommend:
working around it by switching compilers
working around it by using different optimization levels or other compiler options
submitting compiler bug reports [although in this case, it appears there was a compiler bug, although it needn't be submitted as it has been submitted already]
I've analysed your code and my take is that it is sound according the standard but you are hit by gcc bug 323 about which you may find more accessible information in gcc FAQ.
A way to modify your function and render it robust in presence of gcc bug is to store the fact that the previous state was below the threshold instead (or in addition) of storing that state. Something like this:
int update(int* pWasBelow, double* pNextState_scaled) {
static double const threshold = 0.03;
double const nextState = *pNextState_scaled * 1e-6;
int const wasBelow = *pWasBelow;
*pWasBelow = nextState < threshold;
return wasBelow && !*pWasBelow;
}
Note that this does not guarantee reproductibility. You may get 0 1 0 in one set-up and 0 0 1 in another, but you'll detect the transition sooner or later.
I am putting this as an answer because I don't think I can do real code in a comment, but #SteveSummit should get the credit - I would not have likely found this without their comment above.
The general advice is: don't do exact comparisons with floating point values, and it seems like that's what this is doing. If a computed value is almost exactly 0.03 but due to internal representations or optimizations, it's ever so slightly off and not exactly, then it's going to look like a threshold crossing.
So one can resolve this by adding an epsilon for how close one can be to the threshold without having considered to cross it yet.
int update(double * pState, double * pNextState_scaled) {
static const double threshold = 0.03;
static const double close_enough = 0.0000001f; // or whatever
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
// if either value is too close to the threshold, it's not a crossing
if (fabs(oldState - threshold) < close_enough) return 0;
if (fabs(*pState - threshold) < close_enough) return 0;
return oldState < threshold && *pState >= threshold;
}
I imagine you'd have to know your application in order to know how to tune this value appropriately, but a coupla orders of magnitude smaller than the value you're comparing with seems in the neighborhood.
Intervals of floating-point bounds can be used to over-approximate sets of reals, as long as the upper bound of any result interval is computed in round-upwards and the lower bound in round-downwards.
One recommended trick is to actually compute the negation of the lower bound. This allows to keep the FPU in round-upwards at all times (for instance, “Handbook of Floating-Point Arithmetic”, 2.9.2).
This works well for addition and for multiplication. The square root operation, on the other hand, is not symmetrical in the ways addition and multiplication are.
It occurs to me that in order to compute sqrtRD, for the lower bound, the following idiom, despite its complications, might be faster on an ordinary platform with IEEE 754 double-precision and FLT_EVAL_METHOD defined to 0 than changing the rounding mode twice:
#include <fenv.h>
#include <math.h>
#pragma STDC FENV_ACCESS ON
…
/* assumes round-upwards */
double sqrt_rd(double l) {
feclearexcept(FE_INEXACT);
double candidate = sqrt(l);
if (fetestexcept(FE_INEXACT))
return nextafter(candidate, 0);
return candidate;
}
I am wondering whether this is better, and whether if it is the fastest yet. As one possible alternative, but still not necessarily the fastest, it seems to me that FMARU(candidate, candidate, -l) might perhaps not be always exact (because of the directed rounding) but might be accurate enough around 0 for the following to work:
/* assumes round-upwards */
double sqrt_rd(double l) {
double candidate = sqrt(l);
if (fma(candidate, candidate, -l) != 0.0)
return nextafter(candidate, 0);
return candidate;
}
What other inexpensive ways are there to detect that sqrt was inexact?
What combination of floating-point operations leads to the fastest computation of sqrt_rd on a modern FPU set to round upwards?
I think you should be able to use:
/* assumes round-upwards */
double sqrt_rd(double l) {
double u = sqrt(l);
double w = u*u;
if (w != l)
return nextafter(u, 0);
return u;
}
The justification here being that if u is inexact, then it will be strictly greater than √l, which in turn implies that w >= u2 > l (since w is also calculated in RU mode). And if u is exact, then so is w (since we know it must be representable as a double).
fma calculates the result with infinite precision, then applies the rounding mode.
If your candidate is too large, then the infinitely precise result is greater than 0, and since you are rounding up, it will be rounded up. Even if it is only a tiny little bit larger than zero. To verify this, first try l = 1 + 2eps, where (1 + eps) = sqrt (1 + 2eps + eps^2) is just a tiny bit too large; then scale l down by a negative power of 4 so that the eps^2 is way beyond the resolution of denormalised numbers, and check that as well.
How can I check if a float variable contains an integer value? So far, I've been using:
float f = 4.5886;
if (f-(int)f == 0)
printf("yes\n");
else printf("no\n");
But I wonder if there is a better solution, or if this one has any (or many) drawbacks.
Apart from the fine answers already given, you can also use ceilf(f) == f or floorf(f) == f. Both expressions return true if f is an integer. They also returnfalse for NaNs (NaNs always compare unequal) and true for ±infinity, and don't have the problem with overflowing the integer type used to hold the truncated result, because floorf()/ceilf() return floats.
Keep in mind that most of the techniques here are valid presuming that round-off error due to prior calculations is not a factor. E.g. you could use roundf, like this:
float z = 1.0f;
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
The problem with this and other similar techniques (such as ceilf, casting to long, etc.) is that, while they work great for whole number constants, they will fail if the number is a result of a calculation that was subject to floating-point round-off error. For example:
float z = powf(powf(3.0f, 0.05f), 20.0f);
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
Prints "fraction", even though (31/20)20 should equal 3, because the actual calculation result ended up being 2.9999992847442626953125.
Any similar method, be it fmodf or whatever, is subject to this. In applications that perform complex or rounding-prone calculations, usually what you want to do is define some "tolerance" value for what constitutes a "whole number" (this goes for floating-point equality comparisons in general). We often call this tolerance epsilon. For example, lets say that we'll forgive the computer for up to +/- 0.00001 rounding error. Then, if we are testing z, we can choose an epsilon of 0.00001 and do:
if (fabsf(roundf(z) - z) <= 0.00001f) {
printf("integer\n");
} else {
printf("fraction\n");
}
You don't really want to use ceilf here because e.g. ceilf(1.0000001) is 2 not 1, and ceilf(-1.99999999) is -1 not -2.
You could use rintf in place of roundf if you prefer.
Choose a tolerance value that is appropriate for your application (and yes, sometimes zero tolerance is appropriate). For more information, check out this article on comparing floating-point numbers.
if (fmod(f, 1) == 0.0) {
...
}
Don't forget math.h and libm.
stdlib float modf (float x, float *ipart) splits into two parts, check if return value (fractional part) == 0.
if (f <= LONG_MIN || f >= LONG_MAX || f == (long)f) /* it's an integer */
This deals with computational round-off. You set the epsilon as desired:
bool IsInteger(float value)
{
return fabs(ceilf(value) - value) < EPSILON;
}
I'm not 100% sure but when you cast f to an int, and subtract it from f, I believe it is getting cast back to a float. This probably won't matter in this case, but it could present problems down the line if you are expecting that to be an int for some reason.
I don't know if it's a better solution per se, but you could use modulus math instead, for example:
float f = 4.5886;
bool isInt;
isInt = (f % 1.0 != 0) ? false : true;
depending on your compiler you may or not need the .0 after the 1, again the whole implicit casts thing comes into play. In this code, the bool isInt should be true if the right of the decimal point is all zeroes, and false otherwise.
#define twop22 (0x1.0p+22)
#define ABS(x) (fabs(x))
#define isFloatInteger(x) ((ABS(x) >= twop22) || (((ABS(x) + twop22) - twop22) == ABS(x)))
If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.
Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.
Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.
The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.
I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.
I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}
Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.
Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}
I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.
I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}
A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.
If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}
Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}
The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)
Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.
An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}
They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.
Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.