I am trying to calculate the roots of a quadratic equation using the Citardauq Formula, which is a more numerically stable way to calculate those roots. However, when, for example, I enter the equation x^2+200x-0.000002=0 this program does not calculate the roots precisely. Why? I don't find any error in my code and the catastrophic cancellation should not occur here.
You can find why the Citardauq formula works here (second answer).
#include <stdio.h>
#include <math.h>
int main()
{
double a, b, c, determinant;
double root1, root2;
printf("Introduce coefficients a b and c:\n");
scanf("%lf %lf %lf", &a, &b, &c);
determinant = b * b - 4 * a * c;
if (0 > determinant)
{
printf("The equation has no real solution\n");
return 0;
}
if (b > 0)
{
root1 = (-b - sqrt(determinant)) / (2 * a);
root2 = (c / (a * root1));
printf("The solutions are %.16lf and %.16lf\n", root1, root2);
}
else if (b < 0)
{
root1 = (-b + sqrt(determinant)) / (2 * a);
root2 = (c / (a * root1));
printf("The solutions are %.16lf and %.16lf\n", root1, root2);
}
}
Welcome to numerical computations.
There are a few issues here:
1) As pointed by some-programmer-dude there is a problem with precise representation of floating numbers
Is floating point math broken?
For 0.1 in the standard binary64 format, the representation can be
written exactly as
0.1000000000000000055511151231257827021181583404541015625
2) Double precision (double) gives you only 52 bits of significant, 11 bits of exponent, and 1 sign bit.
Floating point numbers in C use IEEE 754 encoding.
3) sqrt precision is also limited.
In your case, the solution is as follows:
You can see that from precision point of view it is not easy equation.
On line calculator 1
gives solutions as:
1.0000007932831068e-8 -200.00000001
Your program is better:
Introduce coefficients a b and c:
1
200
-0.000002
The solutions are -200.0000000100000079 i 0.0000000100000000
So one of the roots is -200.000000010000. Forget about the rest of the digits.
This is exactly what one can expect since double has 15 decimal
digits of precision!
Related
Given a harmonic series 1 - 1/2 + 1/3 - 1/4... = ln(2), is it possible to get a value of 0.69314718056 using only float values and using only basic operations (+,-,*,/). Are there any algorithms which can increase the precision of this calculation without going to unreasonably high values of n (current reasonable limit is 1e^10)
What I currently have: this nets me 8 correct digits -> 0.6931471825
EDIT
The goal is to compute the most precise summation value using only float datatypes
int main()
{
float sum = 0;
int n = 1e9;
double ans = log(2);
int i;
float r = 0;
for (i = n; i > 0; i--) {
r = i - (2*(i/2));
if(r == 0){
sum -= 1.0000000 / i;
}else{
sum += 1.0000000 / i;
}
}
printf("\n%.10f", sum);
printf("\n%.10f", ans);
return 0;
}
On systems where a float is a single-precision IEEE floating point number, it has 24 bits of precision, which is roughly 7 or (log10(224)) digits of decimal precision.
If you change
double ans = log(2);
to
float ans = log(2);
You'll see you already get the best answer possible.
0.6931471 82464599609375 From log(2), casted to float
0.6931471 82464599609375 From your algorithm
0.6931471 8055994530941723... Actual value
\_____/
7 digits
In fact, if you use %A instead of %f, you'll see you get the same answer to the bit.
0X1.62E43P-1 // From log(2), casted to float
0X1.62E43P-1 // From your algorithm
#ikegami already showed this answer in decimal and hex, but to make it even more clear, here are the numbers in binary.
ln(2) is actually:
0.1011000101110010000101111111011111010001110011111…
Rounded to 24 bits, that is:
0.101100010111001000011000
Converted back to decimal, that is:
0.693147182464599609375
...which is the number you got. You simply can't do any better than that, in the 24 bits of precision you've got available in a single-precision float.
So I have this question in statistics that I need to solve using C programming. We have to calculate the values of MSE for various values of theta(population parameter of exponential distribution) and n(sample size. We set theta as constant and calculate MSE for various values of n, and then make n constant and calculate MSE for various theta.
Then we tabulate the results.
This is my program
# include <stdio.h>
# include <math.h>
# include <stdlib.h>
int main(void)
{
int n ,N=5;
float theta,msum =0.0,mse; //paramters
int i,j; // loop
printf("Enter the value of n ");
scanf("%d", &n);
printf("Enter the value of theta ");
scanf("%f ", &theta);
//float u[n], x[n];
//first we fix theta and find MSE for different values of n
for(i=0;i<N;i++)
{
float sum = 0.0;
for(j=0;j<n;j++)
{
//x[j] = (-1/theta)*log(1-(rand()/RAND_MAX));
sum += (-1/theta)*log(1-(rand()/RAND_MAX)); //generates random number from unifrom dist and then converts it to exponential using inverse cdf function
printf("%d%d", i, j);
}
float thetahat = n/sum;
msum += (thetahat - theta)*(thetahat - theta);
}
mse = msum/N;
printf("The MSE with n=%d and theta=%f is %f", n, theta, mse);
return 0;
}
However, this program is not giving any output. I tried multiple IDEs.
Error count is zero. What am I doing wrong?
Use floating point division
rand()/RAND_MAX is int division with a quotient of 0 or 1. Uses 1.0 * rand() / RAND_MAX to coax a floating point division.
Avoid log(0)
log(1-(rand()/RAND_MAX) risks log(0), even with 1.0 * rand() / RAND_MAX. I suspect log(1.0 * (RAND_MAX + 1LL - rand()) / (RAND_MAX + 1LL) will achieve your goal.
Why the space?
The trailing space in scanf("%f ", &theta) obliges scanf() to not return until non-white-space inputs occurs after the number. Drop the space and check the return value.
if (scanf("%f", &theta) != 1) {
; // Handle bad input
}
double vs. float
Code curiously uses float objects, yet double function calls.
Use double as the default floating point type in C unless you have a compelling need for float.
I have problem with floating point rounding. I want to calculate floating point numbers and round them to (given) N decimals. In this example I want to round to 1 decimal places.
Calculation 37.1-28.75 will result into floating point 8.349998 (instead of 8.35), which will result printf rounding to 8.3 instead of 8.4 for 1 decimal places.
The actual result in math is 37.10-28.75=8.35000000, but due to floating point imprecision it is converted into 8.349998, which is then converted into 8.3 instead of 8.4 when using 1 decimal place rounding.
Minimum reproducible example:
float a = 37.10;
float b = 28.75;
//a-b = 8.35 = 8.4
printf("%.1f\n", a - b); //outputs 8.3 instead of 8.4
Is it valid to add following to the result:
float result = a - b;
if (result > 0.0f)
{
result += powf(10, -nr_of_decimals - 1) / 2;
}
else
{
result -= powf(10, -nr_of_decimals - 1) / 2;
}
EDIT: corrected that I want 1 decimal place rounded output, not 2 decimal places
EDIT2: negative results are needed as well (28.75-37.1 = -8.4)
On my system I do actually get 8.35. It's possible that you have to set the rounding direction to "nearest" first, try this (compile with e.g. gcc ... -lm):
#include <fenv.h>
#include <stdio.h>
int main()
{
float a = 37.10;
float b = 28.75;
float res = a - b;
fesetround(FE_TONEAREST);
printf("%.2f\n", res);
}
Binary floating point is, after all, binary, and if you do care about the correct decimal rounding this much, then your choices would be:
decimal floating point, or
fixed point.
I'd say the solution is to use fixed point, especially if you're on embedded, and forget about everything else.
With
int32_t a = 3710;
int32_t b = 2875;
the result of
a - b
will exactly be
835
every time; and then you just need to have a simple fixed point printing routine for the desired precision, and check the following digit after the last digit to see if it needs to be rounded up.
If you want to round to 2 decimals, you can add 0.005 to the result and then offset it with floorf:
float f = 37.10f - 28.75f;
float r = floorf((f + 0.005f) * 100.f) / 100.f;
printf("%f\n", r);
The output is 8.350000
Why are you using floats instead of doubles?
Regarding your question:
Is it valid to add following to the result:
float result = a - b;
if (result > 0.0f)
{
result += powf(10, -nr_of_decimals - 1) / 2;
}
else
{
result -= powf(10, -nr_of_decimals - 1) / 2;
}
It doesn't seem so, on my computer I get 8.350498 instead of 8.350000.
After your edit:
Calculation 37.1-28.75 will result into floating point 8.349998, which will result printf rounding to 8.3 instead of 8.4.
Then
float r = roundf((f + (f < 0.f ? -0.05f : +0.05f)) * 10.f) / 10.f;
is what you are looking for.
In this example, the behaviour of floor differs and I do not understand why:
printf("floor(34000000.535 * 100 + 0.5) : %lf \n", floor(34000000.535 * 100 + 0.5));
printf("floor(33000000.535 * 100 + 0.5) : %lf \n", floor(33000000.535 * 100 + 0.5));
The output for this code is:
floor(34000000.535 * 100 + 0.5) : 3400000053.000000
floor(33000000.535 * 100 + 0.5) : 3300000054.000000
Why does the first result not equal to 3400000054.0 as we could expect?
double in C does not represent every possible number that can be expressed in text.
double can typically represent about 264 different numbers. Neither 34000000.535 nor 33000000.535 are in that set when double is encoded as a binary floating point number. Instead the closest representable number is used.
Text 34000000.535
closest double 34000000.534999996423...
Text 33000000.535
closest double 33000000.535000000149...
With double as a binary floating point number, multiplying by a non-power-of-2, like 100.0, can introduce additional rounding differences. Yet in these cases, it still results in products, one just above xxx.5 and another below.
Adding 0.5, a simple power of 2, does not incurring rounding issues as the value is not extreme compared to 3x00000053.5.
Seeing intermediate results to higher print precision well shows the typical step-by-step process.
#include <stdio.h>
#include <float.h>
#include <math.h>
void fma_test(double a, double b, double c) {
int n = DBL_DIG + 3;
printf("a b c %.*e %.*e %.*e\n", n, a, n, b, n, c);
printf("a*b %.*e\n", n, a*b);
printf("a*b+c %.*e\n", n, a*b+c);
printf("a*b+c %.*e\n", n, floor(a*b+c));
puts("");
}
int main(void) {
fma_test(34000000.535, 100, 0.5);
fma_test(33000000.535, 100, 0.5);
}
Output
a b c 3.400000053499999642e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.400000053499999523e+09
a*b+c 3.400000053999999523e+09
a*b+c 3.400000053000000000e+09
a b c 3.300000053500000015e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.300000053500000000e+09
a*b+c 3.300000054000000000e+09
a*b+c 3.300000054000000000e+09
The issue is more complex then this simple answers as various platforms can 1) use higher precision math like long double or 2) rarely, use a decimal floating point double. So code's results may vary.
Question has been already answered here.
In basic float numbers are just approximation. If we have program like this:
float a = 0.2 + 0.3;
float b = 0.25 + 0.25;
if (a == b) {
//might happen
}
if (a != b) {
// also might happen
}
The only guaranteed thing is that a-b is relatively small.
Question
For a C99 compiler implementing exact IEEE 754 arithmetic, do values of f, divisor of type float exist such that f / divisor != (float)(f * (1.0 / divisor))?
EDIT: By “implementing exact IEEE 754 arithmetic” I mean a compiler that rightfully defines FLT_EVAL_METHOD as 0.
Context
A C compiler that provides IEEE 754-compliant floating-point can only replace a single-precision division by a constant by a single-precision multiplication by the inverse if said inverse is itself representable exactly as a float.
In practice, this only happens for powers of two. So a programmer, Alex, may be confident that f / 2.0f will be compiled as if it had been f * 0.5f, but if it is acceptable for Alex to multiply by 0.10f instead of dividing by 10, Alex should express it by writing the multiplication in the program, or by using a compiler option such as GCC's -ffast-math.
This question is about transforming a single-precision division into a double-precision multiplication. Does it always produce the correctly rounded result? Is there a chance that it could be cheaper, and thus be an optimization that compilers might make (even without -ffast-math)?
I have compared (float)(f * 0.10) and f / 10.0f for all single-precision values of f between 1 and 2, without finding any counter-example. This should cover all divisions of normal floats producing a normal result.
Then I generalized the test to all divisors with the program below:
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void){
for (float divisor = 1.0; divisor != 2.0; divisor = nextafterf(divisor, 2.0))
{
double factor = 1.0 / divisor; // double-precision inverse
for (float f = 1.0; f != 2.0; f = nextafterf(f, 2.0))
{
float cr = f / divisor;
float opt = f * factor; // double-precision multiplication
if (cr != opt)
printf("For divisor=%a, f=%a, f/divisor=%a but (float)(f*factor)=%a\n",
divisor, f, cr, opt);
}
}
}
The search space is just large enough to make this interesting (246). The program is currently running. Can someone tell me whether it will print something, perhaps with an explanation why or why not, before it has finished?
Your program won't print anything, assuming round-ties-to-even rounding mode. The essence of the argument is as follows:
We're assuming that both f and divisor are between 1.0 and 2.0. So f = a / 2^23 and divisor = b / 2^23 for some integers a and b in the range [2^23, 2^24). The case divisor = 1.0 isn't interesting, so we can further assume that b > 2^23.
The only way that (float)(f * (1.0 / divisor)) could give the wrong result would be for the exact value f / divisor to be so close to a halfway case (i.e., a number exactly halfway between two single-precision floats) that the accumulated errors in the expression f * (1.0 / divisor) push us to the other side of that halfway case from the true value.
But that can't happen. For simplicity, let's first assume that f >= divisor, so that the exact quotient is in [1.0, 2.0). Now any halfway case for single precision in the interval [1.0, 2.0) has the form c / 2^24 for some odd integer c with 2^24 < c < 2^25. The exact value of f / divisor is a / b, so the absolute value of the difference f / divisor - c / 2^24 is bounded below by 1 / (2^24 b), so is at least 1 / 2^48 (since b < 2^24). So we're more than 16 double-precision ulps away from any halfway case, and it should be easy to show that the error in the double precision computation can never exceed 16 ulps. (I haven't done the arithmetic, but I'd guess it's easy to show an upper bound of 3 ulps on the error.)
So f / divisor can't be close enough to a halfway case to create problems. Note that f / divisor can't be an exact halfway case, either: since c is odd, c and 2^24 are relatively prime, so the only way we could have c / 2^24 = a / b is if b is a multiple of 2^24. But b is in the range (2^23, 2^24), so that's not possible.
The case where f < divisor is similar: the halfway cases then have the form c / 2^25 and the analogous argument shows that abs(f / divisor - c / 2^25) is greater than 1 / 2^49, which again gives us a margin of 16 double-precision ulps to play with.
It's certainly not possible if non-default rounding modes are possible. For example, in replacing 3.0f / 3.0f with 3.0f * C, a value of C less than the exact reciprocal would yield the wrong result in downward or toward-zero rounding modes, whereas a value of C greater than the exact reciprocal would yield the wrong result for upward rounding mode.
It's less clear to me whether what you're looking for is possible if you restrict to default rounding mode. I'll think about it and revise this answer if I come up with anything.
Random search resulted in an example.
Looks like when the result is a "denormal/subnormal" number, the inequality is possible. But then, maybe my platform is not IEEE 754 compliant?
f 0x1.7cbff8p-25
divisor -0x1.839p+116
q -0x1.f8p-142
q2 -0x1.f6p-142
int MyIsFinite(float f) {
union {
float f;
unsigned char uc[sizeof (float)];
unsigned long ul;
} x;
x.f = f;
return (x.ul & 0x7F800000L) != 0x7F800000L;
}
float floatRandom() {
union {
float f;
unsigned char uc[sizeof (float)];
} x;
do {
size_t i;
for (i=0; i<sizeof(x.uc); i++) x.uc[i] = rand();
} while (!MyIsFinite(x.f));
return x.f;
}
void testPC() {
for (;;) {
volatile float f, divisor, q, qd;
do {
f = floatRandom();
divisor = floatRandom();
q = f / divisor;
} while (!MyIsFinite(q));
qd = (float) (f * (1.0 / divisor));
if (qd != q) {
printf("%a %a %a %a\n", f, divisor, q, qd);
return;
}
}
}
Eclipse PC Version: Juno Service Release 2
Build id: 20130225-0426