Make C floating point literals float (rather than double) - c

It is well known that in C, floating point literals (e.g. 1.23) have type double. As a consequence, any calculation that involves them is promoted to double.
I'm working on an embedded real-time system that has a floating point unit that supports only single precision (float) numbers. All my variables are float, and this precision is sufficient. I don't need (nor can afford) double at all. But every time something like
if (x < 2.5) ...
is written, disaster happens: the slowdown can be up to two orders of magnitude. Of course, the direct answer is to write
if (x < 2.5f) ...
but this is so easy to miss (and difficult to detect until too late), especially when a 'configuration' value is #define'd in a separate file by a less disciplined (or just new) developer.
So, is there a way to force the compiler to treat all (floating point) literals as float, as if with suffix f? Even if it's against the specs, I don't care. Or any other solutions? The compiler is gcc, by the way.

-fsingle-precision-constant flag can be used. It causes floating-point constants to be loaded in single precision even when this is not exact.
Note- This will also use single precision constants in operations on double precision variables.

Use warnings instead: -Wdouble-promotion warns about implicit float to double promotion, as in your example. -Wfloat-conversion will warn about cases where you may still be assigning doubles to floats.
This is a better solution than simply forcing double values to the nearest float value. Your floating-point code is still compliant, and you won't get any nasty surprises if a double value holds a positive value, say, less than FLT_DENORM_MIN (assuming IEEE-754) or greater than FLT_MAX.

You can cast the defined constants to (float) wherever they are used, the optimizer should do its job. This is a portable solution.
#define LIMIT 2.5
if (x < (float)LIMIT) ...

The -Wunsuffixed-float-constants flag could be used too, maybe combined with some of the other options in the accepted answer above. However, this probably won't catch unsuffixed constants in system headers. Would need to use -Wsystem-headers to catch those too. Could generate a lot of warnings...

Related

C fundamentals: double variable not equal to double expression?

I am working with an array of doubles called indata (in the heap, allocated with malloc), and a local double called sum.
I wrote two different functions to compare values in indata, and obtained different results. Eventually I determined that the discrepancy was due to one function using an expression in a conditional test, and the other function using a local variable in the same conditional test. I expected these to be equivalent.
My function A uses:
if (indata[i]+indata[j] > max) hi++;
and my function B uses:
sum = indata[i]+indata[j];
if (sum>max) hi++;
After going through the same data set and max, I end up with different values of hi depending on which function I use. I believe function B is correct, and function A is misleading. Similarly when I try the snippet below
sum = indata[i]+indata[j];
if ((indata[i]+indata[j]) != sum) etc.
that conditional will evaluate to true.
While I understand that floating point numbers do not necessarily provide an exact representation, why does that in-exact representation change when evaluated as an expression vs stored in a variable? Is recommended best practice to always evaluate a double expression like this prior to a conditional? Thanks!
I suspect you're using 32-bit x86, the only common architecture subject to excess precision. In C, expressions of type float and double are actually evaluated as float_t or double_t, whose relationships to float and double are reflected in the FLT_EVAL_METHOD macro. In the case of x86, both are defined as long double because the fpu is not actually capable of performing arithmetic at single or double precision. (It has mode bits intended to allow that, but the behavior is slightly wrong and thus can't be used.)
Assigning to an object of type float or double is one way to force rounding and get rid of the excess precision, but you can also just add a gratuitous cast to (double) if you prefer to leave it as an expression without assignments.
Note that forcing rounding to the desired precision is not equivalent to performing the arithmetic at the desired precision; instead of one rounding step (during the arithmetic) you now have two (during the arithmetic, and again to drop unwanted precision), and in cases where the first rounding gives you an exact-midpoint, the second rounding can go in the 'wrong' direction. This issue is generally called double rounding, and it makes excess precision significantly worse than nominal precision for certain types of calculations.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

Rules-of-thumb for minimising floating-point errors in C?

Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:
float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);
Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:
float c = r * b;
a = a - c;
I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?
If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?
Please don't just say "use higher precision" - that's not what I'm after.
EDIT
For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?
Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b); and float c = r * b; a = a - c; with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.
Extended floating-point precision for intermediate results
If your C99 compiler defines FLT_EVAL_METHOD as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD to 1 or 2, then a = a - (r * b); will be more precise for some values of a, r and b, because all intermediate computations will be done at an extended precision (double for the value 1 and long double for the value 2).
The program cannot set FLT_EVAL_METHOD, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.
Contraction of some intermediate results
Depending whether you use #pragma fp_contract in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a directly and as accurately as allowed by the floating-point type.
However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d can be computed either as fmaf(c, d, a*b) or as fmaf(a, b, c*d)) and the compiler may choose one or the other.
In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.
However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf() to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a.
float fmaf(float x, float y, float z);
DESCRIPTION
The fma() functions compute (x*y)+z, rounded as one ternary operation:
they compute the value (as if) to infinite precision and round once to
the result format, according to the current rounding mode.
Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf() will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double for the accumulator: it will be faster and more accurate than using fmaf(). In the worst case, a flawed implementation of fmaf() will be provided.
Improving accuracy while only using single-precision
Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.

Is there a difference between writing/passing value with or without float "f" suffix?

I believe I should type less if it's possible. Any unnecessary keystroke takes a bit a time.
In the context of objective-C, my question is:
Can I type this
[UIColor colorWithRed:0 green:0 blue:0 alpha:0.15];
[UIColor colorWithRed:0.45 green:0.45 blue:0.45 alpha:0.15];
or do I have to use the f suffix
[UIColor colorWithRed:0.0f green:0.0f blue:0.0f alpha:0.15f];
[UIColor colorWithRed:0.45f green:0.45f blue:0.45f alpha:0.15f];
1) why does it work even without "f"?
2) If I need to write f, do I still have an exception for "0", that means if it's zero, is it still ok without "f"?
What you are really asking is about type literals and implicit casting.
I haven't written C in ages, but this reference leads me to believe it's not dissimilar to C# (I can't speak about objective-c)
The problem boils down to this:
0.0 is a literal notation for the value 0 as a double
0.0f is a literal notation for the value 0 as a float
0 is a literal notation for the value 0 as a int
Supplying an int when a float is expected is fine, as there exists an implicit cast from int to float that the compiler can use.
However, if you specify a double when a float is expected, there is no implicit cast. This is because there is a loss of precision going from double to float and the compiler wants you to explicitly say you're aware of that.
So, if you write 0.0 when a float is expected, expect your compiler to moan at you about loss of precision :)
P.S.
I believe I should type less if it's possible. Any unnecessary keystroke takes a bit a time.
I wouldn't worry too much about the number of keystrokes. You'll waste far more time on unclear code in your life than you will from typing. If you're time concious then your best bet is to write clear and explicit code.
When you type 0 that is an integer constant. If you then assign it to a float, it gets promoted automatically: this is similar to a typecast but the compiler does it automatically.
When you type 0.0f that means "zero floating point" which is directly assigned to the float variable.
There is no meaningful difference between either method in this case.
The fact that you are asking this question, indicates that you should be explicit, despite the extra keystroke. The last thing any programmer wants to do when starting to work on some code is say "WTF is happening here". Code is read more often than it is written, and you've just demonstrated that someone with your level of experience may not know what that code does.
Yes it will work, and no, there's no compile/runtime downside of doing so, but code should be written for other people not the compiler -- the compiler doesn't care what junk you write, it will do it's best with it regardless. Other programmers on the other hand may throw up their hands and step away from the keyboard.
In both cases, the compiled code is identical. (Tested with LLVM 5.1, Xcode 5.1.1.)
The compiler is automatically converting the integer, float and double literals to CGFloats. Note that CGFloat is a float on 32-bit and a double on 64-bit, so the compiler will make a conversion whether you use 0.15f or 0.15.
I advise not worrying about this. My preference is to use the fewest characters, not because it is easier to type but because it it easier to read.

Floating point again

Yesterday I asked a floating point question, and I have another one. I am doing some computations where I use the results of the math.h (C language) sine, cosine and tangent functions.
One of the developers muttered that you have to be careful of the return values of these functions and I should not make assumptions on the return values of the gcc math functions. I am not trying to start a discussion but I really want to know what I need to watch out for when doing computations with the standard math functions.
x
You should not assume that the values returned will be consistent to high degrees of precision between different compiler/stdlib versions.
That's about it.
You should not expect sin(PI/6) to be equal to cos(PI/3), for example. Nor should you expect asin(sin(x)) to be equal to x, even if x is in the domain for sin. They will be close, but might not be equal.
Floating point is straightforward. Just always remember that there is an uncertainty component to all floating point operations and functions. It is usually modelled as being random, even though it usually isn't, but if you treat it as random, you'll succeed in understanding your own code. For instance:
a=a/3*3;
This should be treated as if it was:
a=(a/3+error1)*3+error2;
If you want an estimate of the size of the errors, you need to dig into each operation/function to find out. Different compilers, parameter choice etc. will yield different values. For instance, 0.09-0.089999 on a system with 5 digits precision will yield an error somewhere between -0.000001 and 0.000001. this error is comparable in size with the actual result.
If you want to learn how to do floating point as precise as posible, then it's a study by it's own.
The problem isn't with the standard math functions, so much as the nature of floating point arithmetic.
Very short version: don't compare two floating point numbers for equality, even with obvious, trivial identities like 10 == 10 / 3.0 * 3.0 or tan(x) == sin(x) / cos(x).
you should take care about precision:
Structure of a floating-point number
are you on 32bits, 64 bits Platform ?
you should read IEEE Standard for Binary Floating-Point Arithmetic
there are some intersting libraries such GMP, or MPFR.
you should learn how Comparing floating-point numbers
etc ...
Agreed with all of the responses that say you should not compare for equality. What you can do, however, is check if the numbers are close enough, like so:
if (abs(numberA - numberB) < CLOSE_ENOUGH)
{
// Equal for all intents and purposes
}
Where CLOSE_ENOUGH is some suitably small floating-point value.

Resources