Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.
A decimal representation with a lot of digits is:
1.7320508075688772935274463415058723669428052538103806280558069794519330169088
00037081146186757248575675...
Now, when I calculate \sqrt{3}, I get 1.732051:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
int main() {
double myVar = sqrt (3);
printf("as double:\t%f\n", myVar);
}
According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.
Is there any way I can calculate the error myself?
(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)
Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.
My try
While writing this question, I found this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h> // needed for higher precision
int main() {
long double r = sqrtl(3.0L);
printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
}
With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h>
int main() {
double myVar = sqrt (3);
long double r = sqrtl(3.0L);
long double error = abs(r-myVar) / r;
printf("Double:\t\t%f\n", myVar);
printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
}
But it outputs:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 0
How can I fix that to get the error?
What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.
https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf
printf rounds doubles to 6 places when you use %f without a precision.
e.g.
double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);
My output: -0.00000000000000004445
If the result is 0, your long double and double are the same.
One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.
Frama-C's value analysis can do this for you with option -all-rounding-modes.
double Frama_C_sqrt(double x);
double sqrt(double x)
{
return Frama_C_sqrt(x);
}
double y;
int main(){
y = sqrt(3.0);
}
Analyzing the program with:
frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
y ∈ [1.7320508075688772 .. 1.7320508075688774]
This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].
Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.
You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);
The actual value of double myVar is
1.732050807568877281 //18 digits
so 1.732050807568877281-1.732050807568877281 is zero
According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.
It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.
You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 5.79643049346087304e-17
(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)
Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:
If you are calculating the more accurate long double results, why bother with double?
Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)
In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.
In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.
You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x' – x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.
Is there some class of numbers you want to perform this analysis for?
Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.
Related
I am implementing a fractional delay line algorithm.
One of the tasks involved is the decomposition of a floating-point value into its integral and fractional part.
I know there are a lot of posts about this topic on SO and I probably read most of them.
However I haven’t found one post that deals with the specifics of this scenario.
The algorithm must be using 64-bit floating-point values.
Input floating-point values are guaranteed to always be positive. (delay times cannot be negative)
The output integer part has to be represented by an integer datatype.
The integer datatype must have enough bits so that the double-to-integer conversion occurs without the risk of overflowing.
Issues resulting from floating-point values lacking an exact internal representation must be avoided.
(i.e. 9223372036854775809.0 might be internally represented as 9223372036854775808.9999998 and when cast to integer it erroneously becomes 9223372036854775808)
The implementation should work regardless of rounding mode or compiler optimization settings.
So I wrote a function:
double my_modf(double x, int64_t *intPartOut);
As you can see its signature is similar to the modf() function in the C standard library.
The first implementation I came up with is:
double my_modf(double x, int64_t *intPartOut)
{
double y;
double fracPart = modf(x, &y);
*intPartOut = (int64_t)y;
return fracPart;
}
I have also been experimenting with this implementation which - at least on my machine - runs faster than the previous, however I doubt its robustness.
double my_modf(double x, int64_t *intPartOut)
{
int64_t y = (int64_t)x;
*intPartOut = y;
return x - y;
}
...and this is my latest attempt:
double my_modf(double x, int64_t *intPartOut)
{
*intPartOut = llround(x);
return x - floor(x);
}
I can't make up my mind as to which implementation would be best to use, or if there are other implementations that I haven't considered that would better accomplish the following goals.
I am looking for the (1) most robust and (2) most efficient implementation to decompose a floating-point number into its integral and fractional part, keeping into consideration the list of points mentioned above.
Given that the maximum value of the integer part of the floating-point input x is 263−1 and that x is non-negative, then both:
double my_modf(double x, int64_t *intPartOut)
{
double y;
double fracPart = modf(x, &y);
*intPartOut = y;
return fracPart;
}
and:
double my_modf(double x, int64_t *intPartOut)
{
int64_t y = x;
*intPartOut = y;
return x - y;
}
will correctly return the integer part in intPartOut and the fractional part in the return value regardless of rounding mode.
GCC 9.2 for x86_64 does a better job optimizing the latter version, and so does Apple Clang 11.0.0.
llround will not return the integer part as desired because it rounds to the nearest integer rather than truncating.
Issues about x containing errors cannot be resolved with the information provided in the question. The routines shown above have no error; they return exactly the integer and fractional parts of their input.
Updated answer after reading your comment below.
If you are already sure the values are within [0, 2^63-1] then a simple cast will be faster than llround() since this function may also check for overflow (on my system, the manual page states so, however the C standard does not require it).
On my machine for example (x86-64 Nehalem) casting is a single instruction (cvttsd2si) and llround() is obviously more than one.
Am I guaranteed to get the right result with a simple cast (truncation) or is it safer to round?
Depends on what you mean with "right". If the value in the double can be correctly represented by an int64_t, then sure you're going to get exactly the same value. However, if the value cannot be precisely represented by the double then truncation is automatically performed when casting. If you want to round the value in a different way that's another story and you'll have to use one of ceil(), floor() or round().
If you also are sure that no values will be +/- Infinity or NaN (and in that case you can use -Ofast), then your second implementation should be the fastest if you want truncation, while the third should be the fastest if you want to floor() the value.
I'd like to understand how to calculate the forward, and backward error of a function using the C double (64bit) type.
For example, how would I identify the forward error of the following function:
double func(double x){
return (pow(x,2.0)/cos(x));
}
If the relative error is known to be = 10^-15.
I know that the forward error is the difference in value between the exact answer f(x), and the computed answer ^f(x).
And the backward error is the difference in value between the value ^x, used to compute ^f(x), and the true value of x that would give the calculated value from ^f(x).
The problem I have is that I have no idea how to calculate these errors in practice.
Thank you.
Sample forward difference using extended precision.
Use volatile to prevent double code from using extended precision calculations.
#include <assert.h>
#include <float.h>
#include <math.h>
long double func_test_forward(volatile double x) {
#ifdef LDBL_DIG
assert(LDBL_DIG > DBL_DIG);
#endif
volatile double y = func(x);
long double ly = powl(x, 2.0)/cosl(x);
return y - ly;
}
func() is a problematic function. Ignore the pow(x,2.0) part as that is well behaved. The rest of the function is 1/cos(x) or secant(x) with poles every odd multiple of π/2.
Assuming a good cos(x), that function will never return 0.0. (Mathematically, only cosine(odd*π/2) returns 0.0 and no double is exactly an odd multiple of π/2 - all finite double are rational, π is not.) But 1/cos(x) will have extreme values for values near odd*π/2, but even so, cos() will have small relative error. In theory: +/-1 ULP.
Along with the pow() and the division - each contributing 0.5 ULP, a good math library with an non-overflowed result: total no more than 2 ULP error. Overflow can obviously occur with values of x > sqrt(DBL_MAX).
Now assuming a not so good cos(x), slect values near odd*π/2 may simply return 0.0 and a secant of INF, so the forward error is infinity.
Argument reduction for huge arguments: Good to the last bit gets into how good trig functions are calcaulated
I've got an assignment for FOP to make a scientific calculator, we haven't been taught about the math.h library! my basic approach for one of the function SIN was this
but i'm failing to make this work
#include <stdio.h>
int main()
{
int input;
float pi;
double degree;
double sinx;
long int powerseven;
long int powerfive;
long int powerthree;
input = 5;
degree= (input*pi)/180;
pi=3.142;
powerseven=(degree*degree*degree*degree*degree*degree*degree);
powerfive=(degree*degree*degree*degree*degree);
powerthree=(degree*degree*degree);
sinx = (degree - (powerthree/6) + (powerfive/120) - (powerseven/5040));
printf("%ld", sinx);
getchar();
}
Your code almost works. You have a few problems:
You are using pi before initializing it. I suggest using a more accurate value of pi such as 3.14159265359.
powerseven, powerfive and powerthree should be defined as double instead of as long int. You are losing precision by storing these values in an integer type. Also, when you divide an integer value by an integer value (such as powerthree/6) the remainder is lost. For instance, 9/6 is 1.
Since sinx is a double you should be using printf("%f", sinx);
vacawama covered most of the technical C-language reasons your program isn't working. I'll attempt to cover some algorithmic ones. Using a fixed finite number of taylor series terms to compute sine is going to lose precision quickly as the argument gets farther away from the point at which you did the series expansion, i.e. zero.
To avoid this problem, you want to use the periodicity of the sine function to reduce your argument to a bounded interval. If your input is in radians, this is actually a difficult problem in itself, since pi is not representable in floating point. But as long as you're working in degrees, you can perform argument reduction by repeatedly subtracting the greatest power-of-two multiple of 360 that's less than the argument, until your result is in the interval [0,360). (If you could use the standard library, you could just use fmod for this.)
Once your argument is in a bounded interval, you can just choose an approximation that's sufficiently precise on that interval. A taylor series approximation is certainly one approach you can use at this point, but not the only one.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n,i,ele;
n=5;
ele=pow(n,2);
printf("%d",ele);
return 0;
}
The output is 24.
I'm using GNU/GCC in Code::Blocks.
What is happening?
I know the pow function returns a double , but 25 fits an int type so why does this code print a 24 instead of a 25? If n=4; n=6; n=3; n=2; the code works, but with the five it doesn't.
Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:
Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:
double pow(double, double);
and you're calling pow like this:
pow(5,2);
The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.
The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.
So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.
If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:
You won't get into the situation where you may get subtle rounding errors using pow.
Your intpow function will more than likely run faster than an equivalent call to pow.
You want int result from a function meant for doubles.
You should perhaps use
ele=(int)(0.5 + pow(n,2));
/* ^ ^ */
/* casting and rounding */
Floating-point arithmetic is not exact.
Although small values can be added and subtracted exactly, the pow() function normally works by multiplying logarithms, so even if the inputs are both exact, the result is not. Assigning to int always truncates, so if the inexactness is negative, you'll get 24 rather than 25.
The moral of this story is to use integer operations on integers, and be suspicious of <math.h> functions when the actual arguments are to be promoted or truncated. It's unfortunate that GCC doesn't warn unless you add -Wfloat-conversion (it's not in -Wall -Wextra, probably because there are many cases where such conversion is anticipated and wanted).
For integer powers, it's always safer and faster to use multiplication (division if negative) rather than pow() - reserve the latter for where it's needed! Do be aware of the risk of overflow, though.
When you use pow with variables, its result is double. Assigning to an int truncates it.
So you can avoid this error by assigning result of pow to double or float variable.
So basically
It translates to exp(log(x) * y) which will produce a result that isn't precisely the same as x^y - just a near approximation as a floating point value,. So for example 5^2 will become 24.9999996 or 25.00002
The following C code
int main(){
int n=10;
int t1=pow(10,2);
int t2=pow(n,2);
int t3=2*pow(n,2);
printf("%d\n",t1);
printf("%d\n",t2);
printf("%d\n",t3);
return (0);
}
gives the following output
100
99
199
I am using a devcpp compiler.
It does not make any sense, right?
Any ideas?
(That pow(10,2) is maybe something
like 99.9999 does not explain the first
output. Moreover, I got the same
output even if I include math.h)
You are using a poor-quality math library. A good math library returns exact results for values that are exactly representable.
Generally, math library routines must be approximations both because floating-point formats cannot exactly represent the exact mathematical results and because computing the various functions is difficult. However, for pow, there are a limited number of results that are exactly representable, such as 102. A good math library will ensure that these results are returned correctly. The library you are using fails to do that.
Store the result computations as doubles. Print as double, using %f instead of %d. You will see that the 99 is really more like 99.999997, and this should make more sense.
In general, when working with any floating point math, you should assume results will be approximate; that is, a little off in either direction. So when you want exact results - like you did here - you're going to have trouble.
You should always understand the return type of functions before you use them. See, e.g. cplusplus.com:
double pow (double base, double exponent); /* C90 */
From other answers I understand there are situations when you can expect pow or other floating-point math to be precise. Once you understand the necessary imprecision that plagues floating point math, please consult these.
Your variables t1, t2 and t3 must be of type double because pow() returns double.
But if you do want them to be of type int, use round() function.
int t1 = pow(10,2);
int t2 = round(pow(n,2));
int t3 = 2 * round(pow(n,2));
It rounds the returned values 99.9... and 199.9... to 100.0 and 200.0. And then t2 == 100 because it is of type int and so does t3.
The output will be:
100
100
200
Because the round function returns the integer value nearest to x rounding half-way cases away from zero, regardless of the current rounding direction.
UPDATE: Here is comment from math.h:
/* Excess precision when using a 64-bit mantissa for FPU math ops can
cause unexpected results with some of the MSVCRT math functions. For
example, unless the function return value is stored (truncating to
53-bit mantissa), calls to pow with both x and y as integral values
sometimes produce a non-integral result. ... */