Problem
In using automated CI tests I found some code, which breaks if optimization of gcc is set to -O2. The code should increment a counter if a double value crosses a threshold in either direction.
Workaround
Going down to -O1 or using -ffloat-store option works around the problem.
Example
Here is a small example which shows the same problem.
The update() function should return true whenever a sequence of *pNextState * 1e-6 crosses a threshold of 0.03.
I used call by reference because the values are part of a large struct in the full code.
The idea behind using < and >= is that if a sequence hits the value exactly, the function should return 1 this time and return 0 the next cycle.
test.h:
extern int update(double * pState, double * pNextState);
test.c:
#include "test.h"
int update(double * pState, double * pNextState_scaled) {
static double threshold = 0.03;
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
return oldState < threshold && *pState >= threshold;
}
main.c:
#include <stdio.h>
#include <stdlib.h>
#include "test.h"
int main(void) {
double state = 0.01;
double nextState1 = 20000.0;
double nextState2 = 30000.0;
double nextState3 = 40000.0;
printf("%d\n", update(&state, &nextState1));
printf("%d\n", update(&state, &nextState2));
printf("%d\n", update(&state, &nextState3));
return EXIT_SUCCESS;
}
Using gcc with at least -O2 the output is:
0
0
0
Using gcc with -O1, -O0 or -ffloat-store produces the desired output
0
1
0
As i understand the problem from debugging a problem arises if the compiler optimizes out the local variable oldstate on stack and instead compares against an intermediate result in an floting point register with higher precision (80bit) and the value *pState is a tiny bit smaller than the threshold.
If the value for comparison is stored in 64bit precision, the logic can't miss crossing the threshold. Because of the multiplication by 1e-6 the result is probably stored in an floating point register.
Would you consider this a gcc bug?
clang does not show the problem.
I am using gcc version 9.2.0 on an Intel Core i5, Windows and msys2.
Update
It is clear to me that floating point comparison is not exact and i would consider the following result as valid:
0
0
1
The idea was that, if (*pState >= threshold) == false in one cycle then comparing the same value (oldstate = *pState) against the same threshold in a subsequent call (*pState < threshold) has to be true.
[Disclaimer: This is a generic, shoot-from-the-hip answer. Floating-point issues can be subtle, and I have not analyzed this one carefully. Once in a while, suspicious-looking code like this can be made to work portably and reliably after all, and per the accepted answer, that appears to be the case here. Nevertheless, the general-purpose answer stands in the, well, general case.]
I would consider this a bug in the test case, not in gcc. This sounds like a classic example of code that's unnecessarily brittle with respect to exact floating-point equality.
I would recommend either:
rewriting the test case, or perhaps
removing the test case.
I would not recommend:
working around it by switching compilers
working around it by using different optimization levels or other compiler options
submitting compiler bug reports [although in this case, it appears there was a compiler bug, although it needn't be submitted as it has been submitted already]
I've analysed your code and my take is that it is sound according the standard but you are hit by gcc bug 323 about which you may find more accessible information in gcc FAQ.
A way to modify your function and render it robust in presence of gcc bug is to store the fact that the previous state was below the threshold instead (or in addition) of storing that state. Something like this:
int update(int* pWasBelow, double* pNextState_scaled) {
static double const threshold = 0.03;
double const nextState = *pNextState_scaled * 1e-6;
int const wasBelow = *pWasBelow;
*pWasBelow = nextState < threshold;
return wasBelow && !*pWasBelow;
}
Note that this does not guarantee reproductibility. You may get 0 1 0 in one set-up and 0 0 1 in another, but you'll detect the transition sooner or later.
I am putting this as an answer because I don't think I can do real code in a comment, but #SteveSummit should get the credit - I would not have likely found this without their comment above.
The general advice is: don't do exact comparisons with floating point values, and it seems like that's what this is doing. If a computed value is almost exactly 0.03 but due to internal representations or optimizations, it's ever so slightly off and not exactly, then it's going to look like a threshold crossing.
So one can resolve this by adding an epsilon for how close one can be to the threshold without having considered to cross it yet.
int update(double * pState, double * pNextState_scaled) {
static const double threshold = 0.03;
static const double close_enough = 0.0000001f; // or whatever
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
// if either value is too close to the threshold, it's not a crossing
if (fabs(oldState - threshold) < close_enough) return 0;
if (fabs(*pState - threshold) < close_enough) return 0;
return oldState < threshold && *pState >= threshold;
}
I imagine you'd have to know your application in order to know how to tune this value appropriately, but a coupla orders of magnitude smaller than the value you're comparing with seems in the neighborhood.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":
int32_t i = -5;
const double sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1
Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".
But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0..
My question is: Where exactly is the source of this imprecision?
To investigate this, I wrote a small test program in C:
#include <stdint.h> /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10
double W[MAX_SIZE];
int main( int argc, char *argv[] )
{
volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
double *w = W;
int32_t i = 1 - n;
const int32_t end = n - 1;
const double sqr_N_min_1 = (double)i * i;
/* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
do {
*w++ = 1. - ((double)i * i) / sqr_N_min_1;
} while ( (i+=2) <= end );
/* Then, show the results (only the 1st and last output line matters): */
w = W;
i = 1 - n;
do {
fprintf( stderr, "%e\n", *w++ );
} while ( (i+=2) <= end );
return( 0 );
}
Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):
Godbolt site with five columns
As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?
A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.
Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Not in general.
-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath
Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)
This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)
There's no way of formulating your source to avoid all possible problems.
Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.
GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.
From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:
Convert i to double and square it.
Divide that by sqr_N_min_1.
Subtract that from 1.
and, for -Ofast, computes it:
Prior to the loop, calculate the reciprocal of sqr_N_min_1.
Convert i to double and square it.
Compute the fused multiply-subtract of 1 minus the square times the reciprocal.
The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.
The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.
Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.
Since the gcc option -ffast-math effectively disables NaN and -/+inf, I'm looking for maybe the next best option for representing NaN in my performance-critical math code. Ideally the sentinel value if operated on (add, mul, div, sub, etc..) would yield the sentinel value as NaN would do but I doubt this would be possible since I think NaN is the only value that accomplishes this. -0.0 might not be a good fit as it's also disabled in -ffast-math and could prevent certain optimizations like (x+0.0), etc..
Perhaps my question should rather be, is there any way to use NaN or some other "special double" while being able to enable a lot of the math optimizations without breaking down?
System is Linux/x64, gcc 4.8.1.
If you are looking for a value which would be propagated by arithmetic operations, NaN is still available with option -ffast-math. The problem lies somewhere else. With -ffast-math some operations can removed from the computation due to optimization, and then there is no way to guarantee NaN or any other value would be propagates.
For example, the following, with -ffast-math set, will cause hard writing 0.0 into n and there is no special value for n which would protect from it.
float n = NAN;
n *= 0.0;
One thing you can do, is to use -fno-finite-math-only -ftrapping-math with -ffast-math as Shafik Yaghmour said. And the other is, if there are only few places where you expect a bad value, you can check for it by yourself putting tests exactly in those points.
The last option I can think -- if you really badly need optimization -- is to manually inject NaN (and maybe inf) values into the computation and check for how long it is propagated. Then in those places where the propagation stops, test for NaN (inf) occurrence. -- This is an unsafe method, as I am not one hundred percent sure, can -ffast-math involve conditional flow of operations. If it can, there is a significant chance, this solution will be invalid. So it is risky and if chosen needs very heavy testing covering all branches of the computation.
Normally I would be rather against the last solution, but actually there is a chance, NaN (inf) values will be propagated though the whole computation or almost whole, so it can give the performance you are seeking for. So you may want to take the risk.
Checking for NaN with -ffast-math you can do, as Shafik Yaghmour said, with
inline int isnan(float f)
{
union { float f; uint32_t x; } u = { f };
return (u.x << 1) > 0xff000000u;
}
and for double with
inline int isnan(double d)
{
union { double d; uint64_t x; } u = { d };
return (u.x << 1) > 0xff70000000000000ull;
}
Checking for inf would be
inline int isinf(float f)
{
union { float f; uint32_t x; } u = { f };
return (u.x << 1) == 0xff000000u;
}
inline int isinf(double d)
{
union { double d; uint64_t x; } u = { d };
return (u.x << 1) == 0xff70000000000000ull;
}
You can also merge isnan and isinf.
I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}
The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.
I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.
to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose
You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.
If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.
Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.
Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.
The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.
I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.
I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}
Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.
Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}
I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.
Right, I think I really am living a dream. I have the following piece of code which I compile and run on an AIX machine:
AIX 3 5
PowerPC_POWER5 processor type
IBM XL C/C++ for AIX, V10.1
Version: 10.01.0000.0003
#include <stdio.h>
#include <math.h>
#define RADIAN(x) ((x) * acos(0.0) / 90.0)
double nearest_distance(double radius,double lon1, double lat1, double lon2, double lat2){
double rlat1=RADIAN(lat1);
double rlat2=RADIAN(lat2);
double rlon1=lon1;
double rlon2=lon2;
double a=0,b=0,c=0;
a = sin(rlat1)*sin(rlat2)+ cos(rlat1)*cos(rlat2)*cos(rlon2-rlon1);
printf("%lf\n",a);
if (a > 1) {
printf("aaaaaaaaaaaaaaaa\n");
}
b = acos(a);
c = radius * b;
return radius*(acos(sin(rlat1)*sin(rlat2)+
cos(rlat1)*cos(rlat2)*cos(rlon2-rlon1)));
}
int main(int argc, char** argv) {
nearest_distance(6367.47,10,64,10,64);
return 0;
}
Now, the value of 'a' after the calculation is reported as being '1'. And, on this AIX machine, it looks like 1 > 1 is true as my 'if' is entered !!! And my acos of what I think is '1' returns NanQ since 1 is bigger than 1. May I ask how that is even possible ? I do not know what to think anymore !
The code works just fine on other architectures where 'a' really takes the value of what I think is 1 and acos(a) is 0.
If you do a comparison where result and expctedResult are float types:
if (result == expectedResult)
Then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false.
Comparing with epsilon – absolute error
if (fabs(result - expectedResult) < 0.00001)
From Comparing floating point numbers
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Print out the bits. You might just be getting fooled by some rounding error in the display of the floats as decimal real numbers.
The printf function, without a specified precision, will only show you the first 6 digits. So, try printing with a higher degree of precision... it is possible that a is slightly larger than 1, but only by a little. If you want to make things more robust, instead of (a>1), you can use (a-1)>epsilon for some value of epsilon.
1.000000000000000000001 is greater than 1. Are you sure you just aren't see enough decimal places? If that check is passing I'd wager thats your issue.
The usual solution is to use some form of epsilon to stop you worrying about rounding errors. ie if the double you have ought to be then try doing
if ( a > 1.00001f )
Its probably close enough to one so as not to cause you problems :)