I am trying to get multiplay decimal part of a double number about 500 times. This number starts to lose precision as time goes on. Is there any trick to be able to make the continued multiplication accurate?
double x = 0.3;
double binary = 2.0;
for (i=0; i<500; i++){
x = x * binary;
printf("x equals to : %f",x);
if(x>=1.0)
x = x - 1;
}
Ok after i read some of the things u posted i am thinking how could i remove this unwanted stuff from my number to keep multiplication stable. For instance in my example. My decimal parts will be chaning in such manner: 0.3,0.6,0.2,0.4,0.8... Can we cut the rest to keep this numbers ??
With typical FP is binary64, double x = 0.3; results in x with the value more like 0.29999999999999998890... so code has an difference from the beginning.
Scale x by 10 to stay with exact math - or use a decimal64 double
int main(void) {
double x = 3.0;
double binary = 2.0;
printf("x equals to : %.20f\n",x);
for (int i=0; i<500; i++){
x = x * binary;
printf("x equals to : %.20f\n",x/10);
if(x>=10.0)
x = x - 10;
}
return 0;
}
In general, floating point math is not completely precise, as shown in the other answers and in many online resources. The problem is that certain numbers can not be represented exactly in binary. 0.3 is such a number, but all natural numbers aren't. So you could change your program to this:
double x = 3.0;
double binary = 2.0;
for (i=0; i<500; i++){
x = x * binary;
printf("x equals to : %f",x/10.0);
if(x>=10.0)
x = x - 10.0;
}
Although your program is doing some very unusual things, the main answer to your question is that that is how floating point numbers work. They are imprecise.
http://floating-point-gui.de/basic/
Related
I was calculating e^x using Taylor Series and noticed that when we calculate it for negative x absolute error is large.Is it because we don't have enough precision to calculate it?
(I know that to prevent it we can use e^(-x)=1/e^x)
#include <stdio.h>
#include <math.h>
double Exp(double x);
int main(void)
{
double x;
printf("x=");
scanf("%le", &x);
printf("%le", Exp(x));
return 0;
}
double Exp(double x)
{
double h, eps = 1.e-16, Sum = 1.0;
int i = 2;
h = x;
do
{
Sum += h;
h *= x / i;
i++;
} while (fabs(h) > eps);
return Sum ;
}
For example:
x=-40 the value is 4.24835e-18 but programm gives me 3.116952e-01.The absolute error is ~0.311
x=-50 the value is 1.92875e-22 programm gives me 2.041833e+03.The absolute error is ~2041.833
The problem is caused by rounding errors at the middle phase of the algorithm.
The h is growing quickly as 40/2 * 40/3 * 40 / 4 * ... and oscillating in sign. The values for i, h and Sum for x=-40 for consecutive iterations can be found below (some data points omitted for brevity):
x=-40
i=2 h=800 Sum=-39
i=3 h=-10666.7 Sum=761
i=4 h=106667 Sum=-9905.67
i=5 h=-853333 Sum=96761
i=6 h=5.68889e+06 Sum=-756572
...
i=37 h=-1.37241e+16 Sum=6.63949e+15
i=38 h=1.44464e+16 Sum=-7.08457e+15
i=39 h=-1.48168e+16 Sum=7.36181e+15
i=40 h=1.48168e+16 Sum=-7.45499e+15
i=41 h=-1.44554e+16 Sum=7.36181e+15
i=42 h=1.37671e+16 Sum=-7.09361e+15
i=43 h=-1.28066e+16 Sum=6.67346e+15
i=44 h=1.16423e+16 Sum=-6.13311e+15
i=45 h=-1.03487e+16 Sum=5.50923e+15
i=46 h=8.99891e+15 Sum=-4.83952e+15
...
i=97 h=-2610.22 Sum=1852.36
i=98 h=1065.4 Sum=-757.861
i=99 h=-430.463 Sum=307.534
...
i=138 h=1.75514e-16 Sum=0.311695
i=139 h=-5.05076e-17 Sum=0.311695
3.116952e-01
The peak magnitude of sum is 7e15. This is where the precision is lost. Type double can be represented with about 1e-16 accuracy. This gives expected absolute error of about 0.1 - 1.
As the expected sum (value of exp(-40) is close to zero the final absolute error is close to the maximal absolute error of the partial sums.
For x=-50 the peak value of sum is 1.5e20 what gives the absolute error due to finite representation of double at about 1e3 - 1e4 what is close to observed one.
Not much can be fixed without significant changes to algorithm to avoid forming those partial sums. Alternatively, compute exp(-x) as 1/exp(x).
For negative x, adding the alternating +/- terms creates a computational problems even in the first sum of 1.0 + x as the final sum error can be expected to be as bad as the least significant bit of 1.0 or about 1 part in 1016. This implies x_min as in Exp(x_min) == 1.0e-16 is the minimum useful computational value (e.g. x about -36)
A simple solution is to form a good Exp(positive_x) and for negative values ...
double Exp(double x) {
if (x < 0) {
return 1.0 / Exp(-x);
}
...
A good (and simple) Exp(positive_x) computes terms until a term + 1.0 is still 1.0 as additional small terms do not change the sum significantly. Works well for all x (very small error) except could use improvements when the result should be a sub-normal.
double my_exp(double x) {
if (x < 0) {
return 1.0 / my_exp(-x);
}
double sum = 1.0;
unsigned n = 1;
double term = 1.0;
do {
term *= x / n++;
sum += term;
if (!isfinite(term)) {
return term;
}
} while (1.0 != term + 1.0);
return sum;
}
I wrote a code for calculating sin using its maclaurin series and it works but when I try to calculate it for large x values and try to offset it by giving a large order N (the length of the sum) - eventually it overflows and doesn't give me correct results. This is the code and I would like to know is there an additional way to optimize it so it works for large x values too (it already works great for small x values and really big N values).
Here is the code:
long double calcMaclaurinPolynom(double x, int N){
long double result = 0;
long double atzeretCounter = 2;
int sign = 1;
long double fraction = x;
for (int i = 0; i <= N; i++)
{
result += sign*fraction;
sign = sign*(-1);
fraction = fraction*((x*x) / ((atzeretCounter)*(atzeretCounter + 1)));
atzeretCounter += 2;
}
return result;
}
The major issue is using the series outside its range where it well converges.
As OP said "converted x to radX = (x*PI)/180" indicates the OP is starting with degrees rather than radians, the OP is in luck. The first step in finding my_sin(x) is range reduction. When starting with degrees, the reduction is exact. So reduce the range before converting to radians.
long double calcMaclaurinPolynom(double x /* degrees */, int N){
// Reduce to range -360 to 360
// This reduction is exact, no round-off error
x = fmod(x, 360);
// Reduce to range -180 to 180
if (x >= 180) {
x -= 180;
x = -x;
} else if (x <= -180) {
x += 180;
x = -x;
}
// Reduce to range -90 to 90
if (x >= 90) {
x = 180 - x;
} else if (x <= -90) {
x = -180 - x;
}
//now convert to radians.
x = x*PI/180;
// continue with regular code
Alternative, if using C11, use remquo(). Search SO for sample code.
As #user3386109 commented above, no need to "convert back to degrees".
[Edit]
With typical summation series, summing the least significant terms first improves the precision of the answer. With OP's code this can be done with
for (int i = N; i >= 0; i--)
Alternatively, rather than iterating a fixed number of times, loop until the term has no significance to the sum. The following uses recursion to sum the least significant terms first. With range reduction in the -90 to 90 range, the number of iterations is not excessive.
static double sin_d_helper(double term, double xx, unsigned i) {
if (1.0 + term == 1.0)
return term;
return term - sin_d_helper(term * xx / ((i + 1) * (i + 2)), xx, i + 2);
}
#include <math.h>
double sin_d(double x_degrees) {
// range reduction and d --> r conversion from above
double x_radians = ...
return x_radians * sin_d_helper(1.0, x_radians * x_radians, 1);
}
You can avoid the sign variable by incorporating it into the fraction update as in (-x*x).
With your algorithm you do not have problems with integer overflow in the factorials.
As soon as x*x < (2*k)*(2*k+1) the error - assuming exact evaluation - is bounded by abs(fraction), i.e., the size of the next term in the series.
For large x the biggest source for errors is truncation resp. floating point errors that are magnified via cancellation of the terms of the alternating series. For k about x/2 the terms around the k-th term have the biggest size and have to be offset by other big terms.
Halving-and-Squaring
One easy method to deal with large x without using the value of pi is to employ the trigonometric theorems where
sin(2*x)=2*sin(x)*cos(x)
cos(2*x)=2*cos(x)^2-1=cos(x)^2-sin(x)^2
and first reduce x by halving, simultaneously evaluating the Maclaurin series for sin(x/2^n) and cos(x/2^n) and then employ trigonometric squaring (literal squaring as complex numbers cos(x)+i*sin(x)) to recover the values for the original argument.
cos(x/2^(n-1)) = cos(x/2^n)^2-sin(x/2^n)^2
sin(x/2^(n-1)) = 2*sin(x/2^n)*cos(x/2^n)
then
cos(x/2^(n-2)) = cos(x/2^(n-1))^2-sin(x/2^(n-1))^2
sin(x/2^(n-2)) = 2*sin(x/2^(n-1))*cos(x/2^(n-1))
etc.
See https://stackoverflow.com/a/22791396/3088138 for the simultaneous computation of sin and cos values, then encapsulate it with
def CosSinForLargerX(x,n):
k=0
while abs(x)>1:
k+=1; x/=2
c,s = getCosSin(x,n)
r2=0
for i in range(k):
s2=s*s; c2=c*c; r2=s2+c2
s = 2*c*s
c = c2-s2
return c/r2,s/r2
I am trying to compute the MacLaurin series for e-x = 1 - x + (x2 / 2!) - (x3 / 3!) +...
My values seem to work up to a certain point and then deviate completely. Is there something wrong with rounding or am I using the wrong type of variable for such a question?
int i;
double sum=0;
double x = 8.3;
for(i=0; i<26; i++)
{
sum = sum+ (((pow(-1,i)) * (pow(x,i)))/factorial(i));
printf("Sum = %.12f\n\n\n",sum);
}
return 0;
I don't understand why, but up to the 12th term, the values are correct but after that, it begins to completely differ.
Presumably your factorial function, which you're not showing, is performing integer arithmetic. After 12! you're going to overflow a 32-bit integer. Switch to using double in the factorial function too.
This question already has answers here:
Why can't decimal numbers be represented exactly in binary?
(22 answers)
Closed 8 years ago.
Why is it that when I run the C code
float x = 4.2
int y = 0
y = x*100
printf("%i\n", y);
I get 419 back? Shouldn't it be 420?
This has me stumped.
To illustrate, look at the intermediate values:
int main()
{
float x = 4.2;
int y;
printf("x = %f\n", x);
printf("x * 100 = %f\n", x * 100);
y = x * 100;
printf("y = %i\n", y);
return 0;
}
x = 4.200000 // Original x
x * 100 = 419.999981 // Floating point multiplication precision
y = 419 // Assign to int truncates
Per #Lutzi's excellent suggestion, this is more clearly illustrated if we print all the float values with precision that is higher than they represent:
...
printf("x = %.20f\n", x);
printf("x * 100 = %.20f\n", x * 100);
...
And then you can see that the value assigned to x isn't perfectly precise to start with:
x = 4.19999980926513671875
x * 100 = 419.99998092651367187500
y = 419
A floating point number is stored as an approximate value - not the exact floating point value. It has a representation due to which the result gets truncated when you convert it into an integer. You can see more information about the representation here.
This is an example representation of a single precision floating point number :
float isn't large enough to store 4.2 precisely. If you print x with enough precision you'll probably see it come out as 4.19999995 or so. Multiplying by 100 yields 419.999995 and the integer assignment truncates (rounds down). It should work if you make x a double.
4.2 is not in the finite number space of a float, so the system uses the closest possible approximation, which is slightly below 4.2. If you now multiply this with 100 (which is an exact float), you get 419.99something. printf()ing this with %i performs not rounding, but truncation - so you get 419.
I am calculating g with e and s, which are all doubles. After that I want to cut off all digits after the second and save the result in x, for example:
g = 2.123 => x = 2.12
g = 5.34995 => x = 5.34
and so on. I Use...
g = 0.5*e + 0.5*s;
x = floor(g*100)/100;
...and it works fine most of the time. But sometimes I get strange results. For example:
e = 3.0
s = 1.6
g = 2.30
but x = 2.29!!!
So I tried to track down the error:
g = 0.5*e + 0.5*s;
NSLog(#"%f",g);
gives me g = 2.30
g = g * 100;
NSLog(#"%f",g);
gives me g = 230.0
x = floor(g);
NSLog(#"%f",x);
results in x = 229.0 !!!
I don't get it! Help please! :-)
This will be due to floating point calculations.
Your calculation
g * 100
already brings back
229.99999999999997
From where your issue stems.
Have a look at INFO: Precision and Accuracy in Floating-Point Calculations
Also have a look at Floating point
Accuracy problems
The fact that floating-point numbers
cannot precisely represent all real
numbers, and that floating-point
operations cannot precisely represent
true arithmetic operations, leads to
many surprising situations. This is
related to the finite precision with
which computers generally represent
numbers.
As others have already mentioned, this is due to the limited precision of floating point numbers in computers. These imprecisions show up everywhere a hard yes/no decision about a floating point number is made. In order to resolve the problem, you can add/subtract a small number to find an answer that is correct up to a certain accuracy.
You may find functions like these useful:
#define ACC 1e-7
double floorAcc( double x ) { return floor(x + ACC);}
double ceilAcc( double x ) { return ceil(x - ACC); }
bool isLessThanAcc( double x, double y ) { return (x + ACC) < y; }
bool isEqualAcc( double x, double y ) { return (x + ACC) > y && (x - ACC) < y; }
Of course, these work only in a limited number range. When working with very small or very large numbers, you need to pick another value for ACC.
Note that the value of 'ACC' is in general dependent on the accuracy of the numbers in your application, not on the value of x. For example, comparing two numbers a and b for equality can be done in two ways: isEqualAcc(a, b) and isEqualAcc(a-b, 0). You would want the same result from both ways, even though in the second way the number x is likely much smaller.
Here is a possible approach using intermediate integer results:
double e = 3.0;
double s = 1.6;
NSInteger e1 = e * .5 * 100.0; // 150
NSInteger s1 = s * .5 * 100.0; // 80
double x = (e1 + s1)/100.0; // 2.3