Rules-of-thumb for minimising floating-point errors in C?

Rules-of-thumb for minimising floating-point errors in C? - c

Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:
float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);
Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:
float c = r * b;
a = a - c;
I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?
If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?
Please don't just say "use higher precision" - that's not what I'm after.
EDIT
For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?

Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b); and float c = r * b; a = a - c; with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.
Extended floating-point precision for intermediate results
If your C99 compiler defines FLT_EVAL_METHOD as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD to 1 or 2, then a = a - (r * b); will be more precise for some values of a, r and b, because all intermediate computations will be done at an extended precision (double for the value 1 and long double for the value 2).
The program cannot set FLT_EVAL_METHOD, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.
Contraction of some intermediate results
Depending whether you use #pragma fp_contract in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a directly and as accurately as allowed by the floating-point type.
However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d can be computed either as fmaf(c, d, a*b) or as fmaf(a, b, c*d)) and the compiler may choose one or the other.
In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.
However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf() to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a.
float fmaf(float x, float y, float z);
DESCRIPTION
The fma() functions compute (x*y)+z, rounded as one ternary operation:
they compute the value (as if) to infinite precision and round once to
the result format, according to the current rounding mode.
Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf() will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double for the accumulator: it will be faster and more accurate than using fmaf(). In the worst case, a flawed implementation of fmaf() will be provided.
Improving accuracy while only using single-precision
Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.

Related

Should I use pow and sqrt or just pow for half integers?

In C, I was wondering whether there's an 'optimal' way to compute half-integer powers. In a nutshell, the problem is computing x^(n/2) (assuming n is odd and decently small, and x is a some float). Is there a major difference in performance/accuracy between sqrt(pow(x, n)) and pow(x, 0.5 * n)? Or even reversed: pow(sqrt(x), n).
Is there some other implementation for handling this specific case of half-integers?
My first thought is that you would just use pow and compute the whole thing in one call, but I feel like with floating point roundoff and things I'm losing some of the precision of the question that comes from the fact that this is explicitly a half-integer. I thought then maybe there's better error performance if you use pow for raising to an integer power and let sqrt handle the (1/2) part.
I also noticed that GSL has functions for computing small integer powers; would combining those functions with sqrt be better than just using pow?
I'm pretty new to scientific programming with C so I'm not sure where I would even go to look for implementations of something like this, and Google hasn't really turned anything up.

Floating-point multiplication is a fairly cheap operation in typical modern processors, and multiplication of an integer by .5 introduces no rounding error when a binary-based floating-point format is used. (If the expression is written as n/2, where n is a floating-point type, I would expect a decent compiler to implement it as multiplication by .5. However, to be sure, it can be written as n*.5.)
pow is a complicated routine but its execution time1 is unlikely to be affected much by a difference between pow(x, n) and either pow(sqrt(x), n) or pow(x, n*.5). We can generally expect pow(x, n*.5) to be a good way of computing xn/2.
sqrt typically takes more execution time than floating-point multiplication and may introduce rounding error. We can expect each of sqrt(pow(x, n)) and pow(sqrt(x), n*.5) to at least as much time as pow(x, n*.5), and probably more, with no benefit in accuracy.
Thus, pow(x, n*.5) is preferred.
I also noticed that GSL has functions for computing small integer powers; would combining those functions with sqrt be better than just using pow?
Maybe. pow is an expensive routine, so customized routines for specific powers could out-perform it, even with sqrt added. This would be situation-dependent, and you would likely have to measure it to know.
Footnote
1 Execution time is not actually a single thing. Performing an operation not only consumes time for that operation but may affect other operations being performed in parallel in modern processors and may affect the start times of later operations, and its own start time may be affected by its relationship with prior operations.

Fastest way to get mod 10 in C

In my program I have a great presence of the operation n % 10. I know that the module operation can be done much faster when we have n% m where m is the power of 2, since it can be replaced by n & (m-1 ), however Is there any faster way to calculate modulus if the operand is 10?
In my case n is a uint8_t in some cases and in other cases n is an uint32_t.

Because most modern processors can do multiplication much, much faster than division, it is often possible to speed up division and modulus operations where the dividend is a known small constant by replacing the division with one or two multiplications and a few other fast operations (such as shift and addition).
To do so requires computing at compile-time some magic numbers dependent on the dividend; fortunately most modern compilers know how to do this so you don't need to do anything to take advantage. Just let your compiler do the heavy lifting for you, as #chux suggests in an excellent answer.
You can help the compiler by using unsigned types; for some dividends, signed division and modulus are harder to replace.
The basic outline of the optimisation of modulus looks like this:
If you had exact arithmetic, you could replace x % p with p * ((x * (1/p)) % 1). For constant p, 1/p can be precomputed at compile time. The %1 operation simply consists of discarding the fraction part, which is just a right-shift. So that replaces a division with two multiplies, and if p only has a few bits set, the multiply by p might be further optimised into a few left-shifts.
We can do that computation with fixed-point arithmetic, taking advantage of the fact that most processors produce a double-sized result for integer multiplication. Since we don't care about the integer part of the inner multiplication and we know that the result of the outer multiplication must be less than p, we only need to reserve ceil(log2 p) bits for the integer part of the computation, leaving the rest of the bits for the fraction. And that might give us enough precision to correctly handle the possible range of values of x, particularly if x has a limited range (eg. uint8_t or even uint16_t). The key is finding a position of the fixed-point which minimises the error in representation of 1/p.
For many small values of p, that works. For others, there is an alternative (but slower) solution which involves estimating q = x/p using multiplication by the inverse, and then computing x - q * p. If the estimate of q can be guaranteed to be either correct or off by one in a known direction, we only need to correct the final computation by conditionally adding or subtracting p; that can be accomplished without a branch on many modern CPUs. (The direction of the error is known because it will depend only on whether the approximation we chose for the inverse of the dividend was too small or too big.)
In the very specific case of x % 10 where x is a uint_8, you might be able to do better than the above using a 256-byte lookup table. That would only be worthwhile if you were doing the modulus operation in a tight loop over a large number of values, and even then you'd want to profile carefully to verify that it is an improvement.
I doubt whether that's the best expenditure of your time; there are probably much more fruitful optimisation opportunities in your application.

however Is there any faster way to calculate modulus if the operand is 10?
With a good compiler, no. The compiler would have already emitted good code. You can explore different optimization settings with the compiler.
OTOH, if you know of some restrictions that the compiler cannot assume with n % 10, like values are always positive or of a sub-range, you might be able to out optimize the compiler.
Such micro-optimisation is usually not efficient use of programmer's time.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?

I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.

Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

Can you replace floating points divisions by integer operations?

The Cortex MCU I'm using doesn't have support for floating point divisions in hardware. The GCC compiler solves this by doing them software based, but warns that it can be very slow.
Now I was wondering how I could avoid them altogether. For example, I could blow up the value factor 10000 (integer multiplication), and divide by another large factor (integer division), and get the exact same result.
But would these two operations be actually faster in general than a single floating point operation? For example, does it make sense to replace:
int result = 100 * 0.95f
by
int result = (100 * 9500) / 10000
to get 95% ?

It's better to get rid of division altogether if you can. This is relatively easy if the divisor is a compile-time constant. Typically you arrange things so that any division operation can be replaced by a bitwise shift. So for your example:
unsigned int x = 100;
unsigned int y = (x * (unsigned int)(0.95 * 1024)) >> 10; // y = x * 0.95
Obviously you need to be very aware of the range of x so that you can avoid overflow in the intermediate result.
And as always, remember that premature optimisation is evil - only use fixed point optimisations such as this if you have identified a performance bottleneck.

Yes the integer expression will be faster -the integer divide and multiply instructions are single machine instructions whereas the floating point operations will be either function calls (for direct software floating point) or exception handlers (for FPU instruction emulation) - either way, each operation will comprise multiple instructions.
However, while for simple operations, integer expressions and ad-hoc fixed-point (scaled integer) expressions may be adequate, for math intensive applications involving trigonometry functions and logarithms etc. in can become complex. For that you might employ a common fixed-point representation and library. This is easiest in C++ rather than C, as exemplified by Anthony Williams' fixed point library, where due to extensive operator and function overloading, in most cases you can simply replace the float or double keywords with fixed and existing expressions and algorithms will work with comparable performance to to an FPU equiped ARM for many operations. If you are not comfortable using C++, the rest of your code need not use any C++ specific features, and can essentially be C code compiled as C++.

Is there a preferred way to order floating-point operands?

Suppose I have a very small float a (for instance a=0.5) that enters the following expression:
6000.f * a * a;
Does the order of the operands make any difference? Is it better to write
6000.f * (a*a);
Or even
float result = a*a;
result *= 6000.f;
I've checked the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic but couldn't find anything.
Is there an optimal way to order operands in a floating point operation?

It really depends on the values and your goals. For instance if a is very small, a*a might be zero, whereas 6000.0*a*a (which means (6000.0*a)*a) could still be nonzero. For avoiding overflow and underflow, the general rule is to apply the associative law to first perform multiplications where the operands' logs have opposite sign, which means squaring first is generally a worst strategy. On the other hand, for performance reasons, squaring first might be a very good strategy if you can reuse the value of the square. You may encounter yet another issue, which could matter more for correctness than overflow/underflow issues if your numbers will never be very close to zero or infinity: certain multiplications may be guaranteed to have exact answers, while others involve rounding. In general you'll get the most accurate results by minimizing the number of rounding steps that happen.

The optimal way depends on the purpose, really.
First of all, multiplication is faster than division.
So if you have to write a = a / 2;, it is better to write a = a * 0.5f;.
Your compiler is usually smart enough to replace division with multiplication on constants if the results is the same, but it will not do that with variables of course.
Sometimes, you can optimize a bit by replacing divisions with multiplications, but there may be problems with precision.
Some other operations may be faster but less precise.
Let's take an example.
float f = (a * 100000) / (b * 10);
float g = (a / b) * (100000 / 10);
These are mathematically equivalent but the result can be a little different.
The first uses two multiplication and one division, the second uses one division and one multiplication. In both cases there may be a loss in precision, it depends on the size of a and b, if they are small values first works better, if they are large values second works better
Then... if you have several constants and you want speed, group contants together.
float a = 6.3f * a * 2.0f * 3.1f;
Just write
a = a * (6.3f * 2.0f * 3.1f);
Some compiler optimize well, some other optimize less, but in both cases there is no risk in keeping all constants together.
After we say this we should talk for hours on how processors works.
Even the same family like intel works in a different way between generations!
Some compilers uses SSE instructions, some other doesn't.
Some processor supports SSE2, some SSE, some only MMX... some system don't have an FPU neither!
Each system do better some calculations than other, finding a common thing is hard.
You should just write a readable code, clean and simple, without worryng too much about these unpredictable very low level optimizations.
If your expression looks complicated, do some algebra and\or go to wolframalpha search engine and ask him to optimize that for you :)
Said that, you don't really need to declare one variable and replace its content over and over, compiler usually can optimize less in this situation.
a = 5 + b;
a /= 2 * c;
a += 2 - c;
a *= 7;
just write your expression avoiding this mess :)
a = ((5 + b) / (2 * c) + 2 - c) * 7;
About your specific example, 6000.f * a * a, just write it as you write it, no need to change it; it is fine as it is.

Not typically, no.
That being said, if you're doing multiple operations with large values, it may make sense to order them in a way that avoids overflows or reduces precision errors, based on their precedence and associativity, if the algorithm provides a way to make that obvious. This would, however, require advance knowledge of the values involved, and not just be based on the syntax.

There are indeed algorithms to minimize cumulative error in a sequence of floating-point operations. One such is http://en.wikipedia.org/wiki/Kahan_summation_algorithm. Others exist for other operations: http://www.cs.cmu.edu/~quake-papers/related/Priest.ps.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight