I'm doing some trigonometry calculations in C/C++ and am running into problems with rounding errors. For example, on my Linux system:
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
printf("%e\n", sin(M_PI));
return 0;
}
This program gives the following output:
1.224647e-16
when the correct answer is of course 0.
How much rounding error can I expect when using trig functions? How can I best handle that error? I'm familiar with the Units in Last Place technique for comparing floating point numbers, from Bruce Dawson's Comparing Floating Point Numbers, but that doesn't seem to work here, since 0 and 1.22e-16 are quite a few ULPs apart.
The answer is only 0 for sin(pi) - did you include all the digits of Pi ?
-Has anyone else noticed a distinct lack of, irony/sense of humour around here?
An IEEE double stores 52 bits of mantissa, with the "implicit leading
one" forming a 53 bit number. An error in the bottom bit of a result
therefore makes up about 1/2^53 of the scale of the numbers. Your output is
of the same order as 1.0, so that comes out to just about exactly one
part in 10^16 (because 53*log(2)/log(10) == 15.9).
So yes. This is about the limit of the precision you can expect. I'm
not sure what the ULP technique you're using is, but I suspect you're
applying it wrong.
Sine of π is 0.0.
Sine of M_PI is about 1.224647e-16.
M_PI is not π.
program gives ... 1.224647e-16 when the correct answer is of course 0.
Code gave a correct answer to 7 significant places.
The following does not print the sine of π. It prints the sine of a number close to π. See below pic.
π // 3.141592653589793 2384626433832795...
printf("%.21\n", M_PI); // 3.141592653589793 115998
printf("%.21f\n", sin(M_PI));// 0.000000000000000 122465
Note: With the math function sine(x), the slope of the curve is -1.0 at x = π. The difference of π and M_PI is about the sin(M_PI) - as expected.
am running into problems with rounding errors
The rounding problem occurs when using M_PI to represent π. M_PI is the double constant closest to π, yet since π is irrational and all finite double are rational, they must differ - even by a small amount. So not a direct rounding issue with sin(), cos(), tan(). sin(M_PI) simple exposed the issue started with using M_PI - an inexact π.
This problem, with different non-zero results of sin(M_PI), occurs if code used a different FP type like float, long double or double with something other than 53 binary bits of precision. This is not a precision issue so much as a irrational/rational one.
#Josh Kelley - ok serious answer.
In general you should never compare the results of any operation involving floats or doubles with each other.
The only exceptions is assignment.
float a=10.0;
float b=10.0;
then a==b
Otherwise you always have to write some function like bool IsClose(float a,float b, float error) to allow you to check if two numbers are within 'error' of each other.
Remember to also check signs/use fabs - you could have -1.224647e-16
There are two sources of error. The sin() function and the approximated value of M_PI. Even if the sin() function were 'perfect', it would not return zero unless the value of M_PI were also perfect - which it is not.
I rather think that will be system-dependent. I don't think the Standard has anything to say on how accurate the transcendental functions will be. Unfortunately, I don't remember seeing any discussion of function precision, so you'll probably have to figure it out yourself.
Unless your program requires significant digits out to the 16th decimal place or more, you probably can do the rounding manually. From my experience programming games we always rounded our decimals to a tolerable significant digit. For example:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define HALF 0.5
#define GREATER_EQUAL_HALF(X) (X) >= HALF
double const M_PI = 2 * acos(0.0);
double round(double val, unsigned places = 1)
{
val = val * pow(10.0f, (float)places);
long longval = (long)val;
if ( GREATER_EQUAL_HALF(val - longval) ) {
return ceil(val) / pow(10.0f, (float)places);
} else {
return floor(val) / pow(10.0f, (float)places);
}
}
int main()
{
printf("\nValue %lf", round(sin(M_PI), 10));
return 0;
}
I get the exact same result on my system - I'd say it is close enough
I would solve the problem by changing the format string to "%f\n" :)
However, this gives you a "better" result, or at least on my system it does give -3.661369e-245
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
printf("%e\n", (long double)sin(M_PI));
return 0;
}
Maybe too low accuracy of implementation
M_PI = 3.14159265358979323846 (M_PI is not π)
http://fresh2refresh.com/c/c-function/c-math-h-library-functions/
It is an inaccuracy in implementation, see Stephen C. Steel's comment under Andy Ross` answer above and chux's answer.
Related
I'd like to understand how to calculate the forward, and backward error of a function using the C double (64bit) type.
For example, how would I identify the forward error of the following function:
double func(double x){
return (pow(x,2.0)/cos(x));
}
If the relative error is known to be = 10^-15.
I know that the forward error is the difference in value between the exact answer f(x), and the computed answer ^f(x).
And the backward error is the difference in value between the value ^x, used to compute ^f(x), and the true value of x that would give the calculated value from ^f(x).
The problem I have is that I have no idea how to calculate these errors in practice.
Thank you.
Sample forward difference using extended precision.
Use volatile to prevent double code from using extended precision calculations.
#include <assert.h>
#include <float.h>
#include <math.h>
long double func_test_forward(volatile double x) {
#ifdef LDBL_DIG
assert(LDBL_DIG > DBL_DIG);
#endif
volatile double y = func(x);
long double ly = powl(x, 2.0)/cosl(x);
return y - ly;
}
func() is a problematic function. Ignore the pow(x,2.0) part as that is well behaved. The rest of the function is 1/cos(x) or secant(x) with poles every odd multiple of π/2.
Assuming a good cos(x), that function will never return 0.0. (Mathematically, only cosine(odd*π/2) returns 0.0 and no double is exactly an odd multiple of π/2 - all finite double are rational, π is not.) But 1/cos(x) will have extreme values for values near odd*π/2, but even so, cos() will have small relative error. In theory: +/-1 ULP.
Along with the pow() and the division - each contributing 0.5 ULP, a good math library with an non-overflowed result: total no more than 2 ULP error. Overflow can obviously occur with values of x > sqrt(DBL_MAX).
Now assuming a not so good cos(x), slect values near odd*π/2 may simply return 0.0 and a secant of INF, so the forward error is infinity.
Argument reduction for huge arguments: Good to the last bit gets into how good trig functions are calcaulated
Attempting to divide two floats in C, using the code below:
#include <stdio.h>
#include <math.h>
int main(){
float fpfd = 122.88e6;
float flo = 10e10;
float int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
return(0);
}
To this code, I use the commands:
>> gcc test_prog.c -o test_prog -lm
>> ./test_prog
I then get this output:
Int_Part = 813.000000
Frac_Part = 0.802063
Now, this Frac_part it seems is incorrect. I have tried the same equation on a calculator first and then in Wolfram Alpha and they both give me:
Frac_Part = 0.802083
Notice the number at the fifth decimal place is different.
This may seem insignificant to most, but for the calculations I am doing it is of paramount importance.
Can anyone explain to me why the C code is making this error?
When you have inadequate precision from floating point operations, the first most natural step is to just use floating point types of higher precision, e.g. use double instead of float. (As pointed out immediately in the other answers.)
Second, examine the different floating point operations and consider their precisions. The one that stands out to me as being a source of error is the method above of separating a float into integer part and fractional part, by simply casting to int and subtracting. This is not ideal, because, when you subtract the integer part from the original value, you are doing arithmetic where the three numbers involved (two inputs and result) have very different scales, and this will likely lead to precision loss.
I would suggest to use the C <math.h> function modf instead to split floating point numbers into integer and fractional part. http://www.techonthenet.com/c_language/standard_library_functions/math_h/modf.php
(In greater detail: When you do an operation like f - (int)f, the floating point addition procedure is going to see that two numbers of some given precision X are being added, and it's going to naturally assume that the result will also have precision X. Then it will perform the actual computation under that assumption, and finally reevaluate the precision of the result at the end. Because the initial prediction turned out not to be ideal, some low order bits are going to get lost.)
Float are single precision for floating point, you should instead try to use double, the following code give me the right result:
#include <stdio.h>
#include <math.h>
int main(){
double fpfd = 122.88e6;
double flo = 10e10;
double int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
return(0);
}
Why ?
As I said, float are single precision floating point, they are smaller than double (in most architecture, sizeof(float) < sizeof(double)).
By using double instead of float you will have more bit to store the mantissa and the exponent part of the number (see wikipedia).
float has only 6~9 significant digits, it's not precise enough for most uses in practice. Changing all float variables to double (which provides 15~17 significant digits) gives output:
Int_Part = 813.000000
Frac_Part = 0.802083
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n,i,ele;
n=5;
ele=pow(n,2);
printf("%d",ele);
return 0;
}
The output is 24.
I'm using GNU/GCC in Code::Blocks.
What is happening?
I know the pow function returns a double , but 25 fits an int type so why does this code print a 24 instead of a 25? If n=4; n=6; n=3; n=2; the code works, but with the five it doesn't.
Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:
Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:
double pow(double, double);
and you're calling pow like this:
pow(5,2);
The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.
The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.
So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.
If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:
You won't get into the situation where you may get subtle rounding errors using pow.
Your intpow function will more than likely run faster than an equivalent call to pow.
You want int result from a function meant for doubles.
You should perhaps use
ele=(int)(0.5 + pow(n,2));
/* ^ ^ */
/* casting and rounding */
Floating-point arithmetic is not exact.
Although small values can be added and subtracted exactly, the pow() function normally works by multiplying logarithms, so even if the inputs are both exact, the result is not. Assigning to int always truncates, so if the inexactness is negative, you'll get 24 rather than 25.
The moral of this story is to use integer operations on integers, and be suspicious of <math.h> functions when the actual arguments are to be promoted or truncated. It's unfortunate that GCC doesn't warn unless you add -Wfloat-conversion (it's not in -Wall -Wextra, probably because there are many cases where such conversion is anticipated and wanted).
For integer powers, it's always safer and faster to use multiplication (division if negative) rather than pow() - reserve the latter for where it's needed! Do be aware of the risk of overflow, though.
When you use pow with variables, its result is double. Assigning to an int truncates it.
So you can avoid this error by assigning result of pow to double or float variable.
So basically
It translates to exp(log(x) * y) which will produce a result that isn't precisely the same as x^y - just a near approximation as a floating point value,. So for example 5^2 will become 24.9999996 or 25.00002
I am coding a graphical program in C and my cartesian values are in between [-1,1], I am having trouble rounding off values so that I can use them for plotting and further calculations. I know how to round values greater than 1 with decimals but this I haven't done before.
So how would I go about rounding values? For example,
.7213=.7
.7725= .8
.3666667=.4
.25=.2 or .3
.24=.2
Any suggestions would be gladly appreciated. :)
You don't (and can't, to any high degree of accuracy, due to how floating point values are stored) round floating point values, you can only output them to different degrees of precision. If you wanted all your float values rounded to 1 decimal place before using them in calculations, then do your calculations with integers, with everything multiplied by 10, then divide by 10 just before you display it.
In most languages, people often implement such rounding in an ad hoc way using *10, integral rounding, and /10. For example:
$ cat round.c
#include <stdio.h>
#include <stdint.h>
int main()
{
fprintf(stderr, "%f\n", ((double) ((uint64_t) (10*0.7777))) / 10);
return 0;
}
$ gcc round.c
[tommd#Vodka Test]$ ./a.out
0.700000
Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.
A decimal representation with a lot of digits is:
1.7320508075688772935274463415058723669428052538103806280558069794519330169088
00037081146186757248575675...
Now, when I calculate \sqrt{3}, I get 1.732051:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
int main() {
double myVar = sqrt (3);
printf("as double:\t%f\n", myVar);
}
According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.
Is there any way I can calculate the error myself?
(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)
Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.
My try
While writing this question, I found this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h> // needed for higher precision
int main() {
long double r = sqrtl(3.0L);
printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
}
With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h>
int main() {
double myVar = sqrt (3);
long double r = sqrtl(3.0L);
long double error = abs(r-myVar) / r;
printf("Double:\t\t%f\n", myVar);
printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
}
But it outputs:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 0
How can I fix that to get the error?
What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.
https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf
printf rounds doubles to 6 places when you use %f without a precision.
e.g.
double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);
My output: -0.00000000000000004445
If the result is 0, your long double and double are the same.
One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.
Frama-C's value analysis can do this for you with option -all-rounding-modes.
double Frama_C_sqrt(double x);
double sqrt(double x)
{
return Frama_C_sqrt(x);
}
double y;
int main(){
y = sqrt(3.0);
}
Analyzing the program with:
frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
y ∈ [1.7320508075688772 .. 1.7320508075688774]
This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].
Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.
You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);
The actual value of double myVar is
1.732050807568877281 //18 digits
so 1.732050807568877281-1.732050807568877281 is zero
According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.
It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.
You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 5.79643049346087304e-17
(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)
Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:
If you are calculating the more accurate long double results, why bother with double?
Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)
In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.
In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.
You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x' – x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.
Is there some class of numbers you want to perform this analysis for?
Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.