Clang reciprocal to 1 optimisations - c

After a discussion with colleagues, I ended up testing wether if clang would optimize two divisions, with a reciprocal to 1, to a single division.
const float x = a / b; //x not used elsewhere
const float y = 1 / x;
Theoretically clang could optimize to const float y = b / a if x is used only as a temporary step value, no?
Here's the input&output of a simple test case: https://gist.github.com/Jiboo/d6e839084841d39e5ab6 (in both ouputs you can see that it's performing the two divisions, instead of optimizing)
This related question, is behind my comprehension and seem to focus only on why a specific instruction isn't used, whereas in my case it's the optimisation that isn't done: Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math
Thanks,
JB.

No, clang can not do that.
But first, why are you using float? float has six digits precision, double has 15. Unless you have a good reason, that you can explain, use double.
1 / (a / b) in floating-point arithmetic is not the same as b / a. What the compiler has to do, is in the first case:
Divide a by b
Round the result to the nearest floating-point number
Divide 1 by the result
Round the result to the nearest floating-point number.
In the second case:
Divide b by a.
Round the result to the nearest floating-point number.
The compiler can only change the code if the result is guaranteed to be the same, and if the compiler writer cannot produce a mathematical proof that the result is the same, the compiler cannot change the code. There are two rounding operations in the first case, rounding different numbers, so it is unlikely that the result can be guaranteed to be the same.

The compiler doesn't think like a mathematician. Where you think simplifying the expression is trivial mathematically, the compiler has a lot of other things to consider. It is actually quite likely that the compiler is much smarter than the programmer and also knows far more about the C standard.
Something like this is probably what goes through the optimizing compiler's "mind":
Ah they wrote a / b but only use x at one place, so we don't have to allocate that variable on the stack. I'll remove it and use a CPU register.
Hmm, integer literal 1 divided with a float variable. Okay, we have to invoke balancing here before anything else and turn that literal into a float 1.0f.
The programmer is counting on me to generate code that contains the potential floating point inaccuracy involved in dividing 1.0f with another float variable! So I can't just swap this expression with b / a because then that floating point inaccuracy that the programmer seems to want here would be lost.
And so on. There's a lot of considerations. What machine code you end up with is hard to predict in advance. Just know that the compiler follows your instructions to the letter.

Related

Compiler does not recognise matching float values [duplicate]

I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.

How to avoid floating point round off error in unit tests?

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
}
}
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.
Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.

Optimize C code

I have the following code
void Fun2()
{
if(X<=A)
X=ceil(M*1.0/A*X);
else
X=M*1.0/(M-A)*(M-X);
}
I want to program it in fast manner using C99, take into account the following comments.
Xand A, are 32 bit variables and I declare them as uint64_t, While M as static const uint64_t.
This function is called by another function and the value of A are changed to a new value every n times of calling.
The optimization is needed in the execution time, CPU is Core i3, OS is windows 7
The math model I want to implement it is
F=ceil(Max/A*X) if x<=A
F=floor(M/(M-A)*(M-X)) if x>A
For clarity and no confusion My previous post was
I have the following code
void Fun2()
{
if(X0<=A)
X0=ceil(Max1*X0);
else
X0=Max2*(Max-X0);
}
I want to program it in fast manner using C99, take into account the following comments.
X0, A, Max1, and Max2 are 32 bit variable and I declare them as uint64_t, While Max as static const uint64_t.
This function is called by another function and the values of Max1, A, Max2 are changed to random values every n times of calling.
I work in Windows 7 and in codeblocks software
Thanks
It is completely pointless and impossible to optimize code like this without a specific target in mind. In order to do so, you need the following knowledge:
Which CPU is used.
Which OS is used (if any).
In-depth knowledge of the above, to the point where you know more, or about as much of the system as the people who wrote the optimizer for the given compiler port.
What kind of optimization that is most important: execution speed, RAM usage or program size.
The only kind of optimization you can do without knowing the above is on the algorithm level. There are no such algorithms in the code posted.
Thus your question cannot be answered by anyone until more information is provided.
If "fast manner" means fast execution, your first change is to declare this function as an inline one, a feature of C99.
inline void Fun2()
{
...
...
}
I recall that GNU CC has some interesting macros that may help optimizing this code as well. I don't think this is C99 compliant but it is always interesting to note. I mean: your function has an if statement. If you can know by advance what probability has each branch of being taken, you can do things like:
if (likely(X0<=A)).....
If it's probable that X0 is less or equal than A. Or:
if (unlikely(X0<=A)).....
If it's not probable that X0 is less or equal than A.
With that information, the compiler will optimize the comparison and jump so the most probable branch will be executed with no jumps, so it will be executed faster in architectures with no branch prediction.
Another thing that may improve speed is to use the ?: ternary operator, as both branches assign a value to the same variable, something like this:
inline void Func2()
{
X0 = (X0>=A)? Max1*X0 : Max2*(Max-X0);
}
BTW: why use ceil()? ceil() is for double numbers to round down a decimal number to the nearest non greater integer. If X0 and Max1 are integer numbers, there won't be decimals in the result, so ceil() won't have any effect.
I think one thing that can be improved is not to use floating point. Your code mostly deals with integers, so you want to stick to integer arithmetic.
The only floating point number is Max1. If it's always whole, it can be an integer. If not, you may be able to replace it with two integers: Max1*X0 -> X0 * Max1_nom / Max1_denom. If you calculate the nominator/denominator once, and use many times, this can speed things up.
I'd transform the math model to
Ceil (M*(X-0) / (A-0)) when A<=X
Floor (M*(X-M) / (A-M)) when A>X
with
Ceil (A / B) = Floor((A + (B-1)) / B)
Which substituted to the first gives:
((M * (X - m0) + c ) / ( A - m0))
where
c = A-1; m0 = 0, when A <= X
c = 0; m0 = M, when A >= X
Everything will be performed in integer arithmetic, but it'll be quite tough to calculate the reciprocals in advance;
It may still be possible to use some form of DDA to avoid calculating the division between iterations.
Using the temporary constants c, m0 is simply for unifying the pipeline for both branches as the next step is in pursuit of parallelism.

Fast inverse square of double in C/C++

Recently I was profiling a program in which the hotspot is definitely this
double d = somevalue();
double d2=d*d;
double c = 1.0/d2 // HOT SPOT
The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.
I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)
The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)
Are you sure you need double precision? You can sacrifice precision easily enough:
double d = somevalue();
float c = 1.0f / ((float) d * (float) d);
The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.
Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)
If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)
I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).
The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.
Tricks that don't work
Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.
Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.
Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.
Update
Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.
static double invsq(double x)
{
double y;
int i;
__asm__ (
"cvtpd2ps %1, %0\n\t"
"rcpss %0, %0\n\t"
"cvtps2pd %0, %0"
: "=x"(y)
: "x"(x));
for (i = 0; i < RECIP_ITER; ++i)
y *= 2 - x * y;
return y * y;
}
Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.
I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.
For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)
I suspect the RCPSS method shines if you want to compute several of these in parallel.
Yes, you can certainly try and work something out. Let me just give you some general ideas, you can fill in the details.
First, let's see why Carmack's root works:
We write x = M × 2E in the usual way. Now recall that the IEEE float stores the exponent offset by a bias: If e denoted the exponent field, we have e = Bias + E ≥ 0. Rearranging, we get E = e − Bias.
Now for the inverse square root: x−1/2 = M-1/2 × 2−E/2. The new exponent field is:
e' = Bias − E/2 = 3/2 Bias − e/2
With bit fiddling, we can get the value e/2 from e by shifting, and 3/2 Bias is just a constant.
Moreover, the mantissa M is stored as 1.0 + x with x < 1, and we can approximate M-1/2 as 1 + x/2. Again, the fact that only x is stored in binary means that we get the division by two by simple bit shifting.
Now we look at x−2: this is equal to M−2 × 2−2 E, and we are looking for an exponent field:
e' = Bias − 2 E = 3 Bias − 2 e
Again, 3 Bias is just a constant, and you can get 2 e from e by bitshifting. As for the mantissa, you can approximate (1 + x)−2 by 1 − 2 x, and so the problem reduces to obtaining 2 x from x.
Note that Carmack's magic floating point fiddling doesn't actually compute the result right aaway: Rather, it produces a remarkably accurate estimate, which is used as the starting point for a traditional, iterative computation. But because the estimate is so good, you only need very few rounds of subsequent iteration to get an acceptable result.
For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.
Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?

How can I account for round-off errors in floating-point arithmetic for inverse trig (and sqrt) functions (in C)?

I have a fairly complicated function that takes several double values that represent two vectors in 3-space of the form (magnitude, latitude, longitude) where latitude and longitude are in radians, and an angle. The purpose of the function is to rotate the first vector around the second by the angle specified and return the resultant vector. I have already verified that the code is logically correct and works.
The expected purpose of the function is for graphics, so double precision is not necessary; however, on the target platform, trig (and sqrt) functions that take floats (sinf, cosf, atan2f, asinf, acosf and sqrtf specifically) work faster on doubles than on floats (probably because the instruction to calculate such values may actually require a double; if a float is passed, the value must be cast to a double, which requires copying it to an area with more memory -- i.e. overhead). As a result, all of the variables involved in the function are double precision.
Here is the issue: I am trying to optimize my function so that it can be called more times per second. I have therefore replaced the calls to sin, cos, sqrt, et cetera with calls to the floating point versions of those functions, as they result in a 3-4 times speed increase overall. This works for almost all inputs; however, if the input vectors are close to parallel with the standard unit vectors (i, j, or k), round-off errors for the various functions build up enough to cause later calls to sqrtf or inverse trig functions (asinf, acosf, atan2f) to pass arguments that are just barely outside of the domain of those functions.
So, I am left with this dilemma: either I can only call double precision functions and avoid the problem (and end up with a limit of about 1,300,000 vector operations per second), or I can try to come up with something else. Ultimately, I would like a way to sanitize the input to the inverse trig functions to take care of edge cases (it is trivial for do so for sqrt: just use abs). Branching is not an option, as even a single conditional statement adds so much overhead that any performance gains are lost.
So, any ideas?
Edit: someone expressed confusion over my using doubles versus floating point operations. The function is much faster if I actually store all my values in double-size containers (I.E. double-type variables) than if I store them in float-size containers. However, floating point precision trig operations are faster than double precision trig operations for obvious reasons.
Basically, you need to find a numerically stable algorithm that solves your problem. There are no generic solutions to this kind of thing, it needs to be done for your specific case using concepts such as the condition number if the individual steps. And it may in fact be impossible if the underlying problem is itself ill-conditioned.
Single precision floating point inherently introduces error. So, you need to build your math so that all comparisons have a certain degree of "slop" by using an epsilon factor, and you need to sanitize inputs to functions with limited domains.
The former is easy enough when branching, eg
bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < 0.001f; } // or
bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < (a * 0.0001f); } // for relative error
but that's messy. Clamping domain inputs is a little trickier, but better. The key is to use conditional move operators, which in general do something like
float ExampleOfConditionalMoveIntrinsic( float comparand, float a, float b )
{ return comparand >= 0.0f ? a : b ; }
in a single op, without incurring a branch.
These vary depending on architecture. On the x87 floating point unit you can do it with the FCMOV conditional-move op, but that is clumsy because it depends on condition flags being set previously, so it's slow. Also, there isn't a consistent compiler intrinsic for cmov. This is one of the reasons why we avoid x87 floating point in favor of SSE2 scalar math where possible.
Conditional move is much better supported in SSE by pairing a comparison operator with a bitwise AND. This is preferable even for scalar math:
// assuming you've already used _mm_load_ss to load your floats onto registers
__m128 fsel( __m128 comparand, __m128 a, __m128 b )
{
__m128 zero = {0,0,0,0};
// set low word of mask to all 1s if comparand > 0
__m128 mask = _mm_cmpgt_ss( comparand, zero );
a = _mm_and_ss( a, mask ); // a = a & mask
b = _mm_andnot_ss( mask, b ); // b = ~mask & b
return _mm_or_ss( a, b ); // return a | b
}
}
Compilers are better, but not great, about emitting this sort of pattern for ternaries when SSE2 scalar math is enabled. You can do that with the compiler flag /arch:sse2 on MSVC or -mfpmath=sse on GCC.
On the PowerPC and many other RISC architectures, fsel() is a hardware opcode and thus usually a compiler intrinsic as well.
Have you looked at the Graphics Programming Black Book or perhaps handing the calculations off to your GPU?

Resources