Why is the function floor giving different results in this case? - c

In this example, the behaviour of floor differs and I do not understand why:
printf("floor(34000000.535 * 100 + 0.5) : %lf \n", floor(34000000.535 * 100 + 0.5));
printf("floor(33000000.535 * 100 + 0.5) : %lf \n", floor(33000000.535 * 100 + 0.5));
The output for this code is:
floor(34000000.535 * 100 + 0.5) : 3400000053.000000
floor(33000000.535 * 100 + 0.5) : 3300000054.000000
Why does the first result not equal to 3400000054.0 as we could expect?

double in C does not represent every possible number that can be expressed in text.
double can typically represent about 264 different numbers. Neither 34000000.535 nor 33000000.535 are in that set when double is encoded as a binary floating point number. Instead the closest representable number is used.
Text 34000000.535
closest double 34000000.534999996423...
Text 33000000.535
closest double 33000000.535000000149...
With double as a binary floating point number, multiplying by a non-power-of-2, like 100.0, can introduce additional rounding differences. Yet in these cases, it still results in products, one just above xxx.5 and another below.
Adding 0.5, a simple power of 2, does not incurring rounding issues as the value is not extreme compared to 3x00000053.5.
Seeing intermediate results to higher print precision well shows the typical step-by-step process.
#include <stdio.h>
#include <float.h>
#include <math.h>
void fma_test(double a, double b, double c) {
int n = DBL_DIG + 3;
printf("a b c %.*e %.*e %.*e\n", n, a, n, b, n, c);
printf("a*b %.*e\n", n, a*b);
printf("a*b+c %.*e\n", n, a*b+c);
printf("a*b+c %.*e\n", n, floor(a*b+c));
puts("");
}
int main(void) {
fma_test(34000000.535, 100, 0.5);
fma_test(33000000.535, 100, 0.5);
}
Output
a b c 3.400000053499999642e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.400000053499999523e+09
a*b+c 3.400000053999999523e+09
a*b+c 3.400000053000000000e+09
a b c 3.300000053500000015e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.300000053500000000e+09
a*b+c 3.300000054000000000e+09
a*b+c 3.300000054000000000e+09
The issue is more complex then this simple answers as various platforms can 1) use higher precision math like long double or 2) rarely, use a decimal floating point double. So code's results may vary.

Question has been already answered here.
In basic float numbers are just approximation. If we have program like this:
float a = 0.2 + 0.3;
float b = 0.25 + 0.25;
if (a == b) {
//might happen
}
if (a != b) {
// also might happen
}
The only guaranteed thing is that a-b is relatively small.

Related

Problem with float and int multiplication in C [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 months ago.
I'm using the online compiler https://www.onlinegdb.com/ and in the following code when I multiply 2.1 with 100 the output becomes 209 instead of 210.
#include<stdio.h>
#include <stdint.h>
int main()
{
float x = 1.8;
x = x + 0.3;
int coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = (uint16_t)(x * coefficient);
printf("y: %d\n", y);
return 0;
}
Where am I doing wrong? And what should I do to obtain 210?
I tried to all different type casts still doesn't work.
The following assumes the compiler uses IEEE-754 binary32 and binary64 for float and double, which is overwhelmingly common.
float x = 1.8;
Since 1.8 is a double constant, the compiler converts 1.8 to the nearest double value, 1.8000000000000000444089209850062616169452667236328125. Then, to assign it to the float x, it converts that to the nearest float value, 1.7999999523162841796875.
x = x + 0.3;
The compiler converts 0.3 to the nearest double value, 0.299999999999999988897769753748434595763683319091796875. Then it adds x and that value using double arithmetic, which produces 2.09999995231628400205181605997495353221893310546875.
Then, to assign that to x, it converts it to the nearest float value, 2.099999904632568359375.
uint16_t y = (uint16_t)(x * coefficient);
Since x is float and coefficient is int, the compiler converts the coefficient to float and performs the multiplication using float arithmetic. This produces 209.9999847412109375.
Then the conversion to uint16_t truncates the number, producing 209.
One way to get 210 instead is to use uint16_t y = lroundf(x * coefficient);. (lroundf is declared in <math.h>.) However, to determine what the right way is, you should explain what these numbers are and why you are doing this arithmetic with them.
Floating point numbers are not exact, when you add 1.8 + 0.3,
the FPU might generate a slightly different result from the expected 2.1 (by margin smaller then float Epsilon)
read more about floating-point numbers representation in wiki https://en.wikipedia.org/wiki/Machine_epsilon
what happens to you is:
1.8 + 0.3 = 209.09999999...
then you truncate it to int resulting in 209
you might find this question also relevant to you Why float.Epsilon and not zero? might be
#include<stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
float x = 1.8;
x = x + 0.3;
uint16_t coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = round(x * coefficient);
printf("y: %" PRIu16 "\n", y);
return 0;
}

Round 37.1-28.75 float calculation correctly to 8.4 instead of 8.3

I have problem with floating point rounding. I want to calculate floating point numbers and round them to (given) N decimals. In this example I want to round to 1 decimal places.
Calculation 37.1-28.75 will result into floating point 8.349998 (instead of 8.35), which will result printf rounding to 8.3 instead of 8.4 for 1 decimal places.
The actual result in math is 37.10-28.75=8.35000000, but due to floating point imprecision it is converted into 8.349998, which is then converted into 8.3 instead of 8.4 when using 1 decimal place rounding.
Minimum reproducible example:
float a = 37.10;
float b = 28.75;
//a-b = 8.35 = 8.4
printf("%.1f\n", a - b); //outputs 8.3 instead of 8.4
Is it valid to add following to the result:
float result = a - b;
if (result > 0.0f)
{
result += powf(10, -nr_of_decimals - 1) / 2;
}
else
{
result -= powf(10, -nr_of_decimals - 1) / 2;
}
EDIT: corrected that I want 1 decimal place rounded output, not 2 decimal places
EDIT2: negative results are needed as well (28.75-37.1 = -8.4)
On my system I do actually get 8.35. It's possible that you have to set the rounding direction to "nearest" first, try this (compile with e.g. gcc ... -lm):
#include <fenv.h>
#include <stdio.h>
int main()
{
float a = 37.10;
float b = 28.75;
float res = a - b;
fesetround(FE_TONEAREST);
printf("%.2f\n", res);
}
Binary floating point is, after all, binary, and if you do care about the correct decimal rounding this much, then your choices would be:
decimal floating point, or
fixed point.
I'd say the solution is to use fixed point, especially if you're on embedded, and forget about everything else.
With
int32_t a = 3710;
int32_t b = 2875;
the result of
a - b
will exactly be
835
every time; and then you just need to have a simple fixed point printing routine for the desired precision, and check the following digit after the last digit to see if it needs to be rounded up.
If you want to round to 2 decimals, you can add 0.005 to the result and then offset it with floorf:
float f = 37.10f - 28.75f;
float r = floorf((f + 0.005f) * 100.f) / 100.f;
printf("%f\n", r);
The output is 8.350000
Why are you using floats instead of doubles?
Regarding your question:
Is it valid to add following to the result:
float result = a - b;
if (result > 0.0f)
{
result += powf(10, -nr_of_decimals - 1) / 2;
}
else
{
result -= powf(10, -nr_of_decimals - 1) / 2;
}
It doesn't seem so, on my computer I get 8.350498 instead of 8.350000.
After your edit:
Calculation 37.1-28.75 will result into floating point 8.349998, which will result printf rounding to 8.3 instead of 8.4.
Then
float r = roundf((f + (f < 0.f ? -0.05f : +0.05f)) * 10.f) / 10.f;
is what you are looking for.

Citardauq Formula not working precisely

I am trying to calculate the roots of a quadratic equation using the Citardauq Formula, which is a more numerically stable way to calculate those roots. However, when, for example, I enter the equation x^2+200x-0.000002=0 this program does not calculate the roots precisely. Why? I don't find any error in my code and the catastrophic cancellation should not occur here.
You can find why the Citardauq formula works here (second answer).
#include <stdio.h>
#include <math.h>
int main()
{
double a, b, c, determinant;
double root1, root2;
printf("Introduce coefficients a b and c:\n");
scanf("%lf %lf %lf", &a, &b, &c);
determinant = b * b - 4 * a * c;
if (0 > determinant)
{
printf("The equation has no real solution\n");
return 0;
}
if (b > 0)
{
root1 = (-b - sqrt(determinant)) / (2 * a);
root2 = (c / (a * root1));
printf("The solutions are %.16lf and %.16lf\n", root1, root2);
}
else if (b < 0)
{
root1 = (-b + sqrt(determinant)) / (2 * a);
root2 = (c / (a * root1));
printf("The solutions are %.16lf and %.16lf\n", root1, root2);
}
}
Welcome to numerical computations.
There are a few issues here:
1) As pointed by some-programmer-dude there is a problem with precise representation of floating numbers
Is floating point math broken?
For 0.1 in the standard binary64 format, the representation can be
written exactly as
0.1000000000000000055511151231257827021181583404541015625
2) Double precision (double) gives you only 52 bits of significant, 11 bits of exponent, and 1 sign bit.
Floating point numbers in C use IEEE 754 encoding.
3) sqrt precision is also limited.
In your case, the solution is as follows:
You can see that from precision point of view it is not easy equation.
On line calculator 1
gives solutions as:
1.0000007932831068e-8 -200.00000001
Your program is better:
Introduce coefficients a b and c:
1
200
-0.000002
The solutions are -200.0000000100000079 i 0.0000000100000000
So one of the roots is -200.000000010000. Forget about the rest of the digits.
This is exactly what one can expect since double has 15 decimal
digits of precision!

Implementing single-precision division as double-precision multiplication

Question
For a C99 compiler implementing exact IEEE 754 arithmetic, do values of f, divisor of type float exist such that f / divisor != (float)(f * (1.0 / divisor))?
EDIT: By “implementing exact IEEE 754 arithmetic” I mean a compiler that rightfully defines FLT_EVAL_METHOD as 0.
Context
A C compiler that provides IEEE 754-compliant floating-point can only replace a single-precision division by a constant by a single-precision multiplication by the inverse if said inverse is itself representable exactly as a float.
In practice, this only happens for powers of two. So a programmer, Alex, may be confident that f / 2.0f will be compiled as if it had been f * 0.5f, but if it is acceptable for Alex to multiply by 0.10f instead of dividing by 10, Alex should express it by writing the multiplication in the program, or by using a compiler option such as GCC's -ffast-math.
This question is about transforming a single-precision division into a double-precision multiplication. Does it always produce the correctly rounded result? Is there a chance that it could be cheaper, and thus be an optimization that compilers might make (even without -ffast-math)?
I have compared (float)(f * 0.10) and f / 10.0f for all single-precision values of f between 1 and 2, without finding any counter-example. This should cover all divisions of normal floats producing a normal result.
Then I generalized the test to all divisors with the program below:
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void){
for (float divisor = 1.0; divisor != 2.0; divisor = nextafterf(divisor, 2.0))
{
double factor = 1.0 / divisor; // double-precision inverse
for (float f = 1.0; f != 2.0; f = nextafterf(f, 2.0))
{
float cr = f / divisor;
float opt = f * factor; // double-precision multiplication
if (cr != opt)
printf("For divisor=%a, f=%a, f/divisor=%a but (float)(f*factor)=%a\n",
divisor, f, cr, opt);
}
}
}
The search space is just large enough to make this interesting (246). The program is currently running. Can someone tell me whether it will print something, perhaps with an explanation why or why not, before it has finished?
Your program won't print anything, assuming round-ties-to-even rounding mode. The essence of the argument is as follows:
We're assuming that both f and divisor are between 1.0 and 2.0. So f = a / 2^23 and divisor = b / 2^23 for some integers a and b in the range [2^23, 2^24). The case divisor = 1.0 isn't interesting, so we can further assume that b > 2^23.
The only way that (float)(f * (1.0 / divisor)) could give the wrong result would be for the exact value f / divisor to be so close to a halfway case (i.e., a number exactly halfway between two single-precision floats) that the accumulated errors in the expression f * (1.0 / divisor) push us to the other side of that halfway case from the true value.
But that can't happen. For simplicity, let's first assume that f >= divisor, so that the exact quotient is in [1.0, 2.0). Now any halfway case for single precision in the interval [1.0, 2.0) has the form c / 2^24 for some odd integer c with 2^24 < c < 2^25. The exact value of f / divisor is a / b, so the absolute value of the difference f / divisor - c / 2^24 is bounded below by 1 / (2^24 b), so is at least 1 / 2^48 (since b < 2^24). So we're more than 16 double-precision ulps away from any halfway case, and it should be easy to show that the error in the double precision computation can never exceed 16 ulps. (I haven't done the arithmetic, but I'd guess it's easy to show an upper bound of 3 ulps on the error.)
So f / divisor can't be close enough to a halfway case to create problems. Note that f / divisor can't be an exact halfway case, either: since c is odd, c and 2^24 are relatively prime, so the only way we could have c / 2^24 = a / b is if b is a multiple of 2^24. But b is in the range (2^23, 2^24), so that's not possible.
The case where f < divisor is similar: the halfway cases then have the form c / 2^25 and the analogous argument shows that abs(f / divisor - c / 2^25) is greater than 1 / 2^49, which again gives us a margin of 16 double-precision ulps to play with.
It's certainly not possible if non-default rounding modes are possible. For example, in replacing 3.0f / 3.0f with 3.0f * C, a value of C less than the exact reciprocal would yield the wrong result in downward or toward-zero rounding modes, whereas a value of C greater than the exact reciprocal would yield the wrong result for upward rounding mode.
It's less clear to me whether what you're looking for is possible if you restrict to default rounding mode. I'll think about it and revise this answer if I come up with anything.
Random search resulted in an example.
Looks like when the result is a "denormal/subnormal" number, the inequality is possible. But then, maybe my platform is not IEEE 754 compliant?
f 0x1.7cbff8p-25
divisor -0x1.839p+116
q -0x1.f8p-142
q2 -0x1.f6p-142
int MyIsFinite(float f) {
union {
float f;
unsigned char uc[sizeof (float)];
unsigned long ul;
} x;
x.f = f;
return (x.ul & 0x7F800000L) != 0x7F800000L;
}
float floatRandom() {
union {
float f;
unsigned char uc[sizeof (float)];
} x;
do {
size_t i;
for (i=0; i<sizeof(x.uc); i++) x.uc[i] = rand();
} while (!MyIsFinite(x.f));
return x.f;
}
void testPC() {
for (;;) {
volatile float f, divisor, q, qd;
do {
f = floatRandom();
divisor = floatRandom();
q = f / divisor;
} while (!MyIsFinite(q));
qd = (float) (f * (1.0 / divisor));
if (qd != q) {
printf("%a %a %a %a\n", f, divisor, q, qd);
return;
}
}
}
Eclipse PC Version: Juno Service Release 2
Build id: 20130225-0426

Floating point linear interpolation

To do a linear interpolation between two variables a and b given a fraction f, I'm currently using this code:
float lerp(float a, float b, float f)
{
return (a * (1.0 - f)) + (b * f);
}
I think there's probably a more efficient way of doing it. I'm using a microcontroller without an FPU, so floating point operations are done in software. They are reasonably fast, but it's still something like 100 cycles to add or multiply.
Any suggestions?
n.b. for the sake of clarity in the equation in the code above, we can omit specifying 1.0 as an explicit floating-point literal.
As Jason C points out in the comments, the version you posted is most likely the best choice, due to its superior precision near the edge cases:
float lerp(float a, float b, float f)
{
return a * (1.0 - f) + (b * f);
}
If we disregard from precision for a while, we can simplify the expression as follows:
    a(1 − f) × (b − a)
 = a − af + bf
 = a + f(b − a)
Which means we could write it like this:
float lerp(float a, float b, float f)
{
return a + f * (b - a);
}
In this version we've gotten rid of one multiplication, but lost some precision.
Presuming floating-point math is available, the OP's algorithm is a good one and is always superior to the alternative a + f * (b - a) due to precision loss when a and b significantly differ in magnitude.
For example:
// OP's algorithm
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
// Algebraically simplified algorithm
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
In that example, presuming 32-bit floats lint1(1.0e20, 1.0, 1.0) will correctly return 1.0, whereas lint2 will incorrectly return 0.0.
The majority of precision loss is in the addition and subtraction operators when the operands differ significantly in magnitude. In the above case, the culprits are the subtraction in b - a, and the addition in a + f * (b - a). The OP's algorithm does not suffer from this due to the components being completely multiplied before addition.
For the a=1e20, b=1 case, here is an example of differing results. Test program:
#include <stdio.h>
#include <math.h>
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
int main () {
const float a = 1.0e20;
const float b = 1.0;
int n;
for (n = 0; n <= 1024; ++ n) {
float f = (float)n / 1024.0f;
float p1 = lint1(a, b, f);
float p2 = lint2(a, b, f);
if (p1 != p2) {
printf("%i %.6f %f %f %.6e\n", n, f, p1, p2, p2 - p1);
}
}
return 0;
}
Output, slightly adjusted for formatting:
f lint1 lint2 lint2-lint1
0.828125 17187500894208393216 17187499794696765440 -1.099512e+12
0.890625 10937500768952909824 10937499669441282048 -1.099512e+12
0.914062 8593750447104196608 8593749897348382720 -5.497558e+11
0.945312 5468750384476454912 5468749834720641024 -5.497558e+11
0.957031 4296875223552098304 4296874948674191360 -2.748779e+11
0.972656 2734375192238227456 2734374917360320512 -2.748779e+11
0.978516 2148437611776049152 2148437474337095680 -1.374390e+11
0.986328 1367187596119113728 1367187458680160256 -1.374390e+11
0.989258 1074218805888024576 1074218737168547840 -6.871948e+10
0.993164 683593798059556864 683593729340080128 -6.871948e+10
1.000000 1 0 -1.000000e+00
If you are on a micro-controller without an FPU then floating point is going to be very expensive. Could easily be twenty times slower for a floating point operation. The fastest solution is to just do all the math using integers.
The number of places after the fixed binary point (http://blog.credland.net/2013/09/binary-fixed-point-explanation.html?q=fixed+binary+point) is: XY_TABLE_FRAC_BITS.
Here's a function I use:
inline uint16_t unsignedInterpolate(uint16_t a, uint16_t b, uint16_t position) {
uint32_t r1;
uint16_t r2;
/*
* Only one multiply, and one divide/shift right. Shame about having to
* cast to long int and back again.
*/
r1 = (uint32_t) position * (b-a);
r2 = (r1 >> XY_TABLE_FRAC_BITS) + a;
return r2;
}
With the function inlined it should be approx. 10-20 cycles.
If you've got a 32-bit micro-controller you'll be able to use bigger integers and get larger numbers or more accuracy without compromising performance. This function was used on a 16-bit system.
If you're coding for a microcontroller without floating-point operations, then it's better not to use floating-point numbers at all, and to use fixed-point arithmetic instead.
Since C++20 you can use std::lerp(), which is likely to be the best possible implementation for your target.
It is worth to note, that the standard linear interpolation formulas f1(t)=a+t(b-a), f2(t)=b-(b-a)(1-t), and f3(t)=a(1-t)+bt do not guarantee to be well-behaved when using floating point arithmetic.
Namely, if a != b, it is not guaranteed that the f1(1.0) == b or that f2(0.0) == a, while for a == b, f3(t) is not guaranteed to be equal to a, when 0 < t < 1.
This function has worked for me on processors that support IEEE754 floating point when I need the results to behave well and to hit the endpoints exactly (I use it with double precision, but float should work as well):
double lerp(double a, double b, double t)
{
if (t <= 0.5)
return a+(b-a)*t;
else
return b-(b-a)*(1.0-t);
}
If you want to the final result to be an integer, it might be faster to use integers for the input as well.
int lerp_int(int a, int b, float f)
{
//float diff = (float)(b-a);
//float frac = f*diff;
//return a + (int)frac;
return a + (int)(f * (float)(b-a));
}
This does two casts and one float multiply. If a cast is faster than a float add/subtract on your platform, and if an integer answer is useful to you, this might be a reasonable alternative.

Resources