C fundamentals: double variable not equal to double expression? - c

I am working with an array of doubles called indata (in the heap, allocated with malloc), and a local double called sum.
I wrote two different functions to compare values in indata, and obtained different results. Eventually I determined that the discrepancy was due to one function using an expression in a conditional test, and the other function using a local variable in the same conditional test. I expected these to be equivalent.
My function A uses:
if (indata[i]+indata[j] > max) hi++;
and my function B uses:
sum = indata[i]+indata[j];
if (sum>max) hi++;
After going through the same data set and max, I end up with different values of hi depending on which function I use. I believe function B is correct, and function A is misleading. Similarly when I try the snippet below
sum = indata[i]+indata[j];
if ((indata[i]+indata[j]) != sum) etc.
that conditional will evaluate to true.
While I understand that floating point numbers do not necessarily provide an exact representation, why does that in-exact representation change when evaluated as an expression vs stored in a variable? Is recommended best practice to always evaluate a double expression like this prior to a conditional? Thanks!

I suspect you're using 32-bit x86, the only common architecture subject to excess precision. In C, expressions of type float and double are actually evaluated as float_t or double_t, whose relationships to float and double are reflected in the FLT_EVAL_METHOD macro. In the case of x86, both are defined as long double because the fpu is not actually capable of performing arithmetic at single or double precision. (It has mode bits intended to allow that, but the behavior is slightly wrong and thus can't be used.)
Assigning to an object of type float or double is one way to force rounding and get rid of the excess precision, but you can also just add a gratuitous cast to (double) if you prefer to leave it as an expression without assignments.
Note that forcing rounding to the desired precision is not equivalent to performing the arithmetic at the desired precision; instead of one rounding step (during the arithmetic) you now have two (during the arithmetic, and again to drop unwanted precision), and in cases where the first rounding gives you an exact-midpoint, the second rounding can go in the 'wrong' direction. This issue is generally called double rounding, and it makes excess precision significantly worse than nominal precision for certain types of calculations.

Related

Make C floating point literals float (rather than double)

It is well known that in C, floating point literals (e.g. 1.23) have type double. As a consequence, any calculation that involves them is promoted to double.
I'm working on an embedded real-time system that has a floating point unit that supports only single precision (float) numbers. All my variables are float, and this precision is sufficient. I don't need (nor can afford) double at all. But every time something like
if (x < 2.5) ...
is written, disaster happens: the slowdown can be up to two orders of magnitude. Of course, the direct answer is to write
if (x < 2.5f) ...
but this is so easy to miss (and difficult to detect until too late), especially when a 'configuration' value is #define'd in a separate file by a less disciplined (or just new) developer.
So, is there a way to force the compiler to treat all (floating point) literals as float, as if with suffix f? Even if it's against the specs, I don't care. Or any other solutions? The compiler is gcc, by the way.
-fsingle-precision-constant flag can be used. It causes floating-point constants to be loaded in single precision even when this is not exact.
Note- This will also use single precision constants in operations on double precision variables.
Use warnings instead: -Wdouble-promotion warns about implicit float to double promotion, as in your example. -Wfloat-conversion will warn about cases where you may still be assigning doubles to floats.
This is a better solution than simply forcing double values to the nearest float value. Your floating-point code is still compliant, and you won't get any nasty surprises if a double value holds a positive value, say, less than FLT_DENORM_MIN (assuming IEEE-754) or greater than FLT_MAX.
You can cast the defined constants to (float) wherever they are used, the optimizer should do its job. This is a portable solution.
#define LIMIT 2.5
if (x < (float)LIMIT) ...
The -Wunsuffixed-float-constants flag could be used too, maybe combined with some of the other options in the accepted answer above. However, this probably won't catch unsuffixed constants in system headers. Would need to use -Wsystem-headers to catch those too. Could generate a lot of warnings...

Implementing simple type inference

How do I implement basic type inference, nothing fancy just for inferring if the given value is an integer, double, or float. For instance, if I had a token for each type WHOLE_NUMBER, FLOAT_NUMBER, DOUBLE_NUMBER, and I had an expression like 4f + 2 + 5f, how would I deduce what type that is? My current idea was to just use the first type as the inferred type, so that would be a float. However, this doesn't work in most cases. What would I have to do?
My current idea was to just use the first type as the inferred type
No. Usually, the expression's type is that of its "widest" term. If it contains a double, then it's a double. If not but contains a float, then it's a float. If it has only integers then it is integer...
This applies to each parenthesized sub-expression.
Unless you make an explicit cast.
In your example above, there are 2 floats and an int, so it is a float. The compiler should warn you though, as any implicit conversion it has to make may result in a loss of data.
The way I would do it would be to cast into the most "accurate" or specific type. For example, if you add a bunch of integers together, the result can always be represented by an integer. The moment a floating-point value is included in the expression, the result must be a float, as the result of the calculation might be fractional due to the floating-point term in the addition.
Similarly, if there are any doubles in the expression, the answer must be a double, as down-casting to a float might result in loss of precision. So, the steps required to infer the type are:
Does the expression contain any doubles? If so, the result is a double - cast any integers or floats to double as appropriate. If not...
Does the expression contain any floats? If so, the result is a float - case any integers to float as appropriate. If not...
The result is an integer, as the expression is entirely in terms of integers.
Different programming languages handle these sorts of situations differently, and it might be appropriate to add compiler warnings in situations where these automatic casts could cause a precision error. In general, make sure the behaviour of your compiler/interpreter is well-defined and predictable, such that any developer needing alternate behaviour can (and knows when to) use explicit casts if they need to preserve the accuracy of a calculation.

Float precision

Due to precision of the microcontroller, I defined a symbol containing ratio of two flotants numbers, instead of writing the result directly.
#define INTERVAL (0.01F/0.499F)
instead of
#define INTERVAL 0.02004008016032064F
But the first solution add an other operation "/". If we reason by optimization and correct result, what is the best solution?
They are the same, your compiler will evaluate 0.01F/0.499F at compile-time.
There is a mistake in your constant value 0.01F/0.499F = 0.02004008016032064F.
0.01F/0.499F is evaluated at compile time. The precision used at compile time depends on the compiler and likely exceeds the micro-controller's. Thus either approach will typically provide the same code.
In the unlikelihood the compiler's precision is about the same as the micro-controller's float and typical binary floating-point, the values 0.01F and 0.499F will not be exact but within 0.5 ULP (unit in the last place). The quotient 0.01F/0.499F will be then within about sqrt(2)*0.5 ULP. Using 0.02004008016032064F will be within 0.5 ULP. So under select situations, the constant will be better than the quotient.
Under more rare circumstances, a float precision will be more than 0.02004008016032064F and the quotient would be better.
In the end, recommend coding to whatever values are used to drive the equation. e.g. If 0.01 0.499 are the value of two resistors, use those 2 values.

Rules-of-thumb for minimising floating-point errors in C?

Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:
float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);
Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:
float c = r * b;
a = a - c;
I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?
If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?
Please don't just say "use higher precision" - that's not what I'm after.
EDIT
For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?
Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b); and float c = r * b; a = a - c; with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.
Extended floating-point precision for intermediate results
If your C99 compiler defines FLT_EVAL_METHOD as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD to 1 or 2, then a = a - (r * b); will be more precise for some values of a, r and b, because all intermediate computations will be done at an extended precision (double for the value 1 and long double for the value 2).
The program cannot set FLT_EVAL_METHOD, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.
Contraction of some intermediate results
Depending whether you use #pragma fp_contract in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a directly and as accurately as allowed by the floating-point type.
However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d can be computed either as fmaf(c, d, a*b) or as fmaf(a, b, c*d)) and the compiler may choose one or the other.
In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.
However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf() to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a.
float fmaf(float x, float y, float z);
DESCRIPTION
The fma() functions compute (x*y)+z, rounded as one ternary operation:
they compute the value (as if) to infinite precision and round once to
the result format, according to the current rounding mode.
Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf() will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double for the accumulator: it will be faster and more accurate than using fmaf(). In the worst case, a flawed implementation of fmaf() will be provided.
Improving accuracy while only using single-precision
Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.

Printing a float in C while avoiding variadic parameter promotion to double

How can I print (that is, to stdout) a float in C without having it be promoted to a double when passed to printf?
The issue here is that variadic functions in C promote all float parameter to double, which incurs two unnecessary conversions. For example, if you turn on -Wdouble-promotion in GCC and compile
float f = 0.f;
printf("%f", f);
you will get
warning: implicit conversion from 'float' to 'double' when passing argument to function
I have relatively little processing power to play with (a 72MHz ARM Cortex-M3), and I am definitely bottlenecking on ASCII output of floating point data. As the architecture lacks a hardware FPU to begin with, having to convert between single and double precision does not help matters.
Is there a way to print a float more efficiently in straight C?
Avoiding the promotion will not save you anything, since the internal double (or more likely long double) arithmetic printf will perform is going to consume at least 1000x as much time. Accurately printing floating point values is not easy.
If you don't care about accuracy though, and just need to print approximate values quickly, you can roll your own loop to do the printing. As long as your values aren't too large to fit in an integer type, first convert and print the non-fractional part as an integer, then subtract that off and loop multiplying by 10 and taking off the integer part to print the fractional part one digit at a time (buffer it in a string for better performance).
Or you could just do something like:
printf("%d.%.6d", (int)x, (int)((x-(int)x)*1000000));
Unfortunately, printf does not have support for handing plain float:s.
This mean that you would have to write your own print function. If you don't need the full expressive power of printf, you could easily convert your floating-point value to an integral part and a part representing a number of decimals, and print out both using integers.
If, on the other hand, you simply would like to get rid of the warning, you could explicitly cast the float to a double.
I think that doesnt matter - printf is already such a timeconsuming nasty thing, that those conversion should not matter. The time converting float to double should be far less than converting any number to ascii (you should/could profile your code to get there a definitve answer). The only remaining solution would be to write an own custom output routine which converts float->ascii and then uses puts (or similar).
First approach: Use ftoa instead of printf. Profile.
For increased output flexibility, I would go into the source code of your compiler's stdlib, perhaps some derivative of gcc anyway, locate the printf implementation and copy over the relevant code for double -> ascii conversion. Rewrite it to float -> ascii.
Next, manually change one or two porminent call-sites to your new (non-variadic) version and profile it.
If it solves your problem, you could think of rewriting your own printf, based on the version from stdlib, whereby instead of float you pass float*. That should get rid of the automatic promotion.

Resources