I'd like to understand how to calculate the forward, and backward error of a function using the C double (64bit) type.
For example, how would I identify the forward error of the following function:
double func(double x){
return (pow(x,2.0)/cos(x));
}
If the relative error is known to be = 10^-15.
I know that the forward error is the difference in value between the exact answer f(x), and the computed answer ^f(x).
And the backward error is the difference in value between the value ^x, used to compute ^f(x), and the true value of x that would give the calculated value from ^f(x).
The problem I have is that I have no idea how to calculate these errors in practice.
Thank you.
Sample forward difference using extended precision.
Use volatile to prevent double code from using extended precision calculations.
#include <assert.h>
#include <float.h>
#include <math.h>
long double func_test_forward(volatile double x) {
#ifdef LDBL_DIG
assert(LDBL_DIG > DBL_DIG);
#endif
volatile double y = func(x);
long double ly = powl(x, 2.0)/cosl(x);
return y - ly;
}
func() is a problematic function. Ignore the pow(x,2.0) part as that is well behaved. The rest of the function is 1/cos(x) or secant(x) with poles every odd multiple of π/2.
Assuming a good cos(x), that function will never return 0.0. (Mathematically, only cosine(odd*π/2) returns 0.0 and no double is exactly an odd multiple of π/2 - all finite double are rational, π is not.) But 1/cos(x) will have extreme values for values near odd*π/2, but even so, cos() will have small relative error. In theory: +/-1 ULP.
Along with the pow() and the division - each contributing 0.5 ULP, a good math library with an non-overflowed result: total no more than 2 ULP error. Overflow can obviously occur with values of x > sqrt(DBL_MAX).
Now assuming a not so good cos(x), slect values near odd*π/2 may simply return 0.0 and a secant of INF, so the forward error is infinity.
Argument reduction for huge arguments: Good to the last bit gets into how good trig functions are calcaulated
Related
I've got an assignment for FOP to make a scientific calculator, we haven't been taught about the math.h library! my basic approach for one of the function SIN was this
but i'm failing to make this work
#include <stdio.h>
int main()
{
int input;
float pi;
double degree;
double sinx;
long int powerseven;
long int powerfive;
long int powerthree;
input = 5;
degree= (input*pi)/180;
pi=3.142;
powerseven=(degree*degree*degree*degree*degree*degree*degree);
powerfive=(degree*degree*degree*degree*degree);
powerthree=(degree*degree*degree);
sinx = (degree - (powerthree/6) + (powerfive/120) - (powerseven/5040));
printf("%ld", sinx);
getchar();
}
Your code almost works. You have a few problems:
You are using pi before initializing it. I suggest using a more accurate value of pi such as 3.14159265359.
powerseven, powerfive and powerthree should be defined as double instead of as long int. You are losing precision by storing these values in an integer type. Also, when you divide an integer value by an integer value (such as powerthree/6) the remainder is lost. For instance, 9/6 is 1.
Since sinx is a double you should be using printf("%f", sinx);
vacawama covered most of the technical C-language reasons your program isn't working. I'll attempt to cover some algorithmic ones. Using a fixed finite number of taylor series terms to compute sine is going to lose precision quickly as the argument gets farther away from the point at which you did the series expansion, i.e. zero.
To avoid this problem, you want to use the periodicity of the sine function to reduce your argument to a bounded interval. If your input is in radians, this is actually a difficult problem in itself, since pi is not representable in floating point. But as long as you're working in degrees, you can perform argument reduction by repeatedly subtracting the greatest power-of-two multiple of 360 that's less than the argument, until your result is in the interval [0,360). (If you could use the standard library, you could just use fmod for this.)
Once your argument is in a bounded interval, you can just choose an approximation that's sufficiently precise on that interval. A taylor series approximation is certainly one approach you can use at this point, but not the only one.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n,i,ele;
n=5;
ele=pow(n,2);
printf("%d",ele);
return 0;
}
The output is 24.
I'm using GNU/GCC in Code::Blocks.
What is happening?
I know the pow function returns a double , but 25 fits an int type so why does this code print a 24 instead of a 25? If n=4; n=6; n=3; n=2; the code works, but with the five it doesn't.
Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:
Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:
double pow(double, double);
and you're calling pow like this:
pow(5,2);
The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.
The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.
So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.
If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:
You won't get into the situation where you may get subtle rounding errors using pow.
Your intpow function will more than likely run faster than an equivalent call to pow.
You want int result from a function meant for doubles.
You should perhaps use
ele=(int)(0.5 + pow(n,2));
/* ^ ^ */
/* casting and rounding */
Floating-point arithmetic is not exact.
Although small values can be added and subtracted exactly, the pow() function normally works by multiplying logarithms, so even if the inputs are both exact, the result is not. Assigning to int always truncates, so if the inexactness is negative, you'll get 24 rather than 25.
The moral of this story is to use integer operations on integers, and be suspicious of <math.h> functions when the actual arguments are to be promoted or truncated. It's unfortunate that GCC doesn't warn unless you add -Wfloat-conversion (it's not in -Wall -Wextra, probably because there are many cases where such conversion is anticipated and wanted).
For integer powers, it's always safer and faster to use multiplication (division if negative) rather than pow() - reserve the latter for where it's needed! Do be aware of the risk of overflow, though.
When you use pow with variables, its result is double. Assigning to an int truncates it.
So you can avoid this error by assigning result of pow to double or float variable.
So basically
It translates to exp(log(x) * y) which will produce a result that isn't precisely the same as x^y - just a near approximation as a floating point value,. So for example 5^2 will become 24.9999996 or 25.00002
The following C code
int main(){
int n=10;
int t1=pow(10,2);
int t2=pow(n,2);
int t3=2*pow(n,2);
printf("%d\n",t1);
printf("%d\n",t2);
printf("%d\n",t3);
return (0);
}
gives the following output
100
99
199
I am using a devcpp compiler.
It does not make any sense, right?
Any ideas?
(That pow(10,2) is maybe something
like 99.9999 does not explain the first
output. Moreover, I got the same
output even if I include math.h)
You are using a poor-quality math library. A good math library returns exact results for values that are exactly representable.
Generally, math library routines must be approximations both because floating-point formats cannot exactly represent the exact mathematical results and because computing the various functions is difficult. However, for pow, there are a limited number of results that are exactly representable, such as 102. A good math library will ensure that these results are returned correctly. The library you are using fails to do that.
Store the result computations as doubles. Print as double, using %f instead of %d. You will see that the 99 is really more like 99.999997, and this should make more sense.
In general, when working with any floating point math, you should assume results will be approximate; that is, a little off in either direction. So when you want exact results - like you did here - you're going to have trouble.
You should always understand the return type of functions before you use them. See, e.g. cplusplus.com:
double pow (double base, double exponent); /* C90 */
From other answers I understand there are situations when you can expect pow or other floating-point math to be precise. Once you understand the necessary imprecision that plagues floating point math, please consult these.
Your variables t1, t2 and t3 must be of type double because pow() returns double.
But if you do want them to be of type int, use round() function.
int t1 = pow(10,2);
int t2 = round(pow(n,2));
int t3 = 2 * round(pow(n,2));
It rounds the returned values 99.9... and 199.9... to 100.0 and 200.0. And then t2 == 100 because it is of type int and so does t3.
The output will be:
100
100
200
Because the round function returns the integer value nearest to x rounding half-way cases away from zero, regardless of the current rounding direction.
UPDATE: Here is comment from math.h:
/* Excess precision when using a 64-bit mantissa for FPU math ops can
cause unexpected results with some of the MSVCRT math functions. For
example, unless the function return value is stored (truncating to
53-bit mantissa), calls to pow with both x and y as integral values
sometimes produce a non-integral result. ... */
Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.
A decimal representation with a lot of digits is:
1.7320508075688772935274463415058723669428052538103806280558069794519330169088
00037081146186757248575675...
Now, when I calculate \sqrt{3}, I get 1.732051:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
int main() {
double myVar = sqrt (3);
printf("as double:\t%f\n", myVar);
}
According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.
Is there any way I can calculate the error myself?
(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)
Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.
My try
While writing this question, I found this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h> // needed for higher precision
int main() {
long double r = sqrtl(3.0L);
printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
}
With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h>
int main() {
double myVar = sqrt (3);
long double r = sqrtl(3.0L);
long double error = abs(r-myVar) / r;
printf("Double:\t\t%f\n", myVar);
printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
}
But it outputs:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 0
How can I fix that to get the error?
What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.
https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf
printf rounds doubles to 6 places when you use %f without a precision.
e.g.
double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);
My output: -0.00000000000000004445
If the result is 0, your long double and double are the same.
One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.
Frama-C's value analysis can do this for you with option -all-rounding-modes.
double Frama_C_sqrt(double x);
double sqrt(double x)
{
return Frama_C_sqrt(x);
}
double y;
int main(){
y = sqrt(3.0);
}
Analyzing the program with:
frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
y ∈ [1.7320508075688772 .. 1.7320508075688774]
This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].
Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.
You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);
The actual value of double myVar is
1.732050807568877281 //18 digits
so 1.732050807568877281-1.732050807568877281 is zero
According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.
It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.
You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 5.79643049346087304e-17
(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)
Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:
If you are calculating the more accurate long double results, why bother with double?
Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)
In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.
In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.
You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x' – x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.
Is there some class of numbers you want to perform this analysis for?
Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.
I'm doing some trigonometry calculations in C/C++ and am running into problems with rounding errors. For example, on my Linux system:
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
printf("%e\n", sin(M_PI));
return 0;
}
This program gives the following output:
1.224647e-16
when the correct answer is of course 0.
How much rounding error can I expect when using trig functions? How can I best handle that error? I'm familiar with the Units in Last Place technique for comparing floating point numbers, from Bruce Dawson's Comparing Floating Point Numbers, but that doesn't seem to work here, since 0 and 1.22e-16 are quite a few ULPs apart.
The answer is only 0 for sin(pi) - did you include all the digits of Pi ?
-Has anyone else noticed a distinct lack of, irony/sense of humour around here?
An IEEE double stores 52 bits of mantissa, with the "implicit leading
one" forming a 53 bit number. An error in the bottom bit of a result
therefore makes up about 1/2^53 of the scale of the numbers. Your output is
of the same order as 1.0, so that comes out to just about exactly one
part in 10^16 (because 53*log(2)/log(10) == 15.9).
So yes. This is about the limit of the precision you can expect. I'm
not sure what the ULP technique you're using is, but I suspect you're
applying it wrong.
Sine of π is 0.0.
Sine of M_PI is about 1.224647e-16.
M_PI is not π.
program gives ... 1.224647e-16 when the correct answer is of course 0.
Code gave a correct answer to 7 significant places.
The following does not print the sine of π. It prints the sine of a number close to π. See below pic.
π // 3.141592653589793 2384626433832795...
printf("%.21\n", M_PI); // 3.141592653589793 115998
printf("%.21f\n", sin(M_PI));// 0.000000000000000 122465
Note: With the math function sine(x), the slope of the curve is -1.0 at x = π. The difference of π and M_PI is about the sin(M_PI) - as expected.
am running into problems with rounding errors
The rounding problem occurs when using M_PI to represent π. M_PI is the double constant closest to π, yet since π is irrational and all finite double are rational, they must differ - even by a small amount. So not a direct rounding issue with sin(), cos(), tan(). sin(M_PI) simple exposed the issue started with using M_PI - an inexact π.
This problem, with different non-zero results of sin(M_PI), occurs if code used a different FP type like float, long double or double with something other than 53 binary bits of precision. This is not a precision issue so much as a irrational/rational one.
#Josh Kelley - ok serious answer.
In general you should never compare the results of any operation involving floats or doubles with each other.
The only exceptions is assignment.
float a=10.0;
float b=10.0;
then a==b
Otherwise you always have to write some function like bool IsClose(float a,float b, float error) to allow you to check if two numbers are within 'error' of each other.
Remember to also check signs/use fabs - you could have -1.224647e-16
There are two sources of error. The sin() function and the approximated value of M_PI. Even if the sin() function were 'perfect', it would not return zero unless the value of M_PI were also perfect - which it is not.
I rather think that will be system-dependent. I don't think the Standard has anything to say on how accurate the transcendental functions will be. Unfortunately, I don't remember seeing any discussion of function precision, so you'll probably have to figure it out yourself.
Unless your program requires significant digits out to the 16th decimal place or more, you probably can do the rounding manually. From my experience programming games we always rounded our decimals to a tolerable significant digit. For example:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define HALF 0.5
#define GREATER_EQUAL_HALF(X) (X) >= HALF
double const M_PI = 2 * acos(0.0);
double round(double val, unsigned places = 1)
{
val = val * pow(10.0f, (float)places);
long longval = (long)val;
if ( GREATER_EQUAL_HALF(val - longval) ) {
return ceil(val) / pow(10.0f, (float)places);
} else {
return floor(val) / pow(10.0f, (float)places);
}
}
int main()
{
printf("\nValue %lf", round(sin(M_PI), 10));
return 0;
}
I get the exact same result on my system - I'd say it is close enough
I would solve the problem by changing the format string to "%f\n" :)
However, this gives you a "better" result, or at least on my system it does give -3.661369e-245
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
printf("%e\n", (long double)sin(M_PI));
return 0;
}
Maybe too low accuracy of implementation
M_PI = 3.14159265358979323846 (M_PI is not π)
http://fresh2refresh.com/c/c-function/c-math-h-library-functions/
It is an inaccuracy in implementation, see Stephen C. Steel's comment under Andy Ross` answer above and chux's answer.