Underflow error in floating point arithmetic in C

Underflow error in floating point arithmetic in C - c

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}

For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.

There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}

A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.

The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

Related

How to find remainder of a double in C? Modulo only works for integers

This is what I've found so far online,
int main(void)
{
long a = 12345;
int b = 10;
int remain = a - (a / b) * b;
printf("%i\n", remain);
}
First I wonder how the formula works. Maybe i cant do math, but the priority of operations here seems a bit odd. If i run this code the expected answer of 5 is printed. But I dont get how (a / b) * b doesn't cancel out to 'a' leading to a - a = 0.
Now, this only works for int and long, as soon as double are involved it doesn't work anymore. Anyone might tell me why? Is there an alternative to modulo that works for double?
Also I'm not sure if i understand up to what value a long can go, i found online that the upper limit was 2147483647 but when i input bigger numbers such as the one in 'a' the code runs without any issue up to a certain point...
Thanks for your help I'm new to coding and trying to learn!

Given two double finite numbers x and y, with y not equal to zero, fmod(x, y) produces the remainder of x when divided by y. Specifically, it returns x − ny, where n is chosen so that x − ny has the same sign as x and is smaller in magnitude than y. (So, if x is positive, 0 ≤ fmod(x, y) < x, and, if x is negative, x < fmod(x, y) ≤ 0.)
fmod is declared in <math.h>.
A properly implemented fmod returns an exact result; there is no floating-point error, since the specified result is always representable.
The C standard also specifies remquo to return the remainder and some low bits (at least three) of the quotient n and remainder with a variation on the definition of the remainder. It also specifies variants of these functions for float and long double.

Naive implementation. Limited range. Adds additional floating point imprecisions (as it does some arithmetic)
double naivemod(double x)
{
return x - (long long)x;
}
int main(void)
{
printf("%.50f\n", naivemod(345345.567567756));
printf("%.50f\n", naivemod(.0));
printf("%.50f\n", naivemod(10.5));
printf("%.50f\n", naivemod(-10.0/3));
}

Power function giving different answer than math.pow function in C

I was trying to write a program to calculate the value of x^n using a while loop:
#include <stdio.h>
#include <math.h>
int main()
{
float x = 3, power = 1, copyx;
int n = 22, copyn;
copyx = x;
copyn = n;
while (n)
{
if ((n % 2) == 1)
{
power = power * x;
}
n = n / 2;
x *= x;
}
printf("%g^%d = %f\n", copyx, copyn, power);
printf("%g^%d = %f\n", copyx, copyn, pow(copyx, copyn));
return 0;
}
Up until the value of 15 for n, the answer from my created function and the pow function (from math.h) gives the same value; but, when the value of n exceeds 15, then it starts giving different answers.
I cannot understand why there is a difference in the answer. Is it that I have written the function in the wrong way or it is something else?

You are mixing up two different types of floating-point data. The pow function uses the double type but your loop uses the float type (which has less precision).
You can make the results coincide by either using the double type for your x, power and copyx variables, or by calling the powf function (which uses the float type) instead of pow.
The latter adjustment (using powf) gives the following output (clang-cl compiler, Windows 10, 64-bit):
3^22 = 31381059584.000000
3^22 = 31381059584.000000
And, changing the first line of your main to double x = 3, power = 1, copyx; gives the following:
3^22 = 31381059609.000000
3^22 = 31381059609.000000
Note that, with larger and larger values of n, you are increasingly likely to get divergence between the results of your loop and the value calculated using the pow or powf library functions. On my platform, the double version gives the same results, right up to the point where the value overflows the range and becomes Infinity. However, the float version starts to diverge around n = 55:
3^55 = 174449198498104595772866560.000000
3^55 = 174449216944848669482418176.000000

When I run your code I get this:
3^22 = 31381059584.000000
3^22 = 31381059609.000000
This would be because pow returns a double but your code uses float. When I changed to powf I got identical results:
3^22 = 31381059584.000000
3^22 = 31381059584.000000
So simply use double everywhere if you need high resolution results.

Floating point math is imprecise (and float is worse than double, having even fewer bits to store the data in; using double might delay the imprecision longer). The pow function (usually) uses an exponentiation algorithm that minimizes precision loss, and/or delegates to a chip-level instruction that may do stuff more efficiently, more precisely, or both. There could be more than one implementation of pow too, depending on whether you tell the compiler to use strictly conformant floating point math, the fastest possible, the hardware instruction, etc.
Your code is fine (though using double would get more precise results), but matching the improved precision of math.h's pow is non-trivial; by the time you've done so, you'll have reinvented it. That's why you use the library function.
That said, for logically integer math as you're using here, precision loss from your algorithm likely doesn't matter, it's purely the float vs. double issue where you lose precision from the type itself. As a rule, default to using double, and only switch to float if you're 100% sure you don't need the precision and can't afford the extra memory/computation cost of double.

Precision
float x = 3, power = 1; ... power = power * x forms a float product.
pow(x, y) forms a double result and good implementations internally use even wider math.
OP's loop method incurs rounded results after the 15th iteration. These roundings slowly compound the inaccuracy of the final result.
316 is a 26 bit odd number.
float encodes all odd numbers exactly until typically 224. Larger values are all even and of only 24 significant binary digits.
double encodes all odd numbers exactly until typically 253.
To do a fair comparison, use:
double objects and pow() or
float objects and powf().
For large powers, the pow(f)() function is certain to provide better answers than a loop at such functions often use internally extended precision and well managed rounding vs. the loop approach.

What is a more accurate algorithm I can use to calculate the sine of a number?

I have this code that calculates a guess for sine and compares it to the standard C library's (glibc's in my case) result:
#include <stdio.h>
#include <math.h>
double double_sin(double a)
{
a -= (a*a*a)/6;
return a;
}
int main(void)
{
double clib_sin = sin(.13),
my_sin = double_sin(.13);
printf("%.16f\n%.16f\n%.16f\n", clib_sin, my_sin, clib_sin-my_sin);
return 0;
}
The accuracy for double_sin is poor (about 5-6 digits). Here's my output:
0.1296341426196949
0.1296338333333333
0.0000003092863615
As you can see, after .12963, the results differ.
Some notes:
I don't think the Taylor series will work for this specific situation, the factorials required for greater accuracy aren't able to be stored inside an unsigned long long.
Lookup tables are not an option, they take up too much space and generally don't provide any information on how to calculate the result.
If you use magic numbers, please explain them (although I would prefer if they were not used).
I would greatly prefer an algorithm is easily understandable and able to be used as a reference over one that is not.
The result does not have to be perfectly accurate. A minimum would be the requirements of IEEE 754, C, and/or POSIX.
I'm using the IEEE-754 double format, which can be relied on.
The range supported needs to be at least from -2*M_PI to 2*M_PI. It would be nice if range reduction were included.
What is a more accurate algorithm I can use to calculate the sine of a number?
I had an idea about something similar to Newton-Raphson, but for calculating sine instead. However, I couldn't find anything on it and am ruling this possibility out.

You can actually get pretty close with the Taylor series. The trick is not to calculate the full factorial on each iteration.
The Taylor series looks like this:
sin(x) = x^1/1! - x^3/3! + x^5/5! - x^7/7!
Looking at the terms, you calculate the next term by multiplying the numerator by x^2, multiplying the denominator by the next two numbers in the factorial, and switching the sign. Then you stop when adding the next term doesn't change the result.
So you could code it like this:
double double_sin(double x)
{
double result = 0;
double factor = x;
int i;
for (i=2; result+factor!=result; i+=2) {
result += factor;
factor *= -(x*x)/(i*(i+1));
}
return result;
}
My output:
0.1296341426196949
0.1296341426196949
-0.0000000000000000
EDIT:
The accuracy can be increased further if the terms are added in the reverse direction, however this means computing a fixed number of terms:
#define FACTORS 30
double double_sin(double x)
{
double result = 0;
double factor = x;
int i, j;
double factors[FACTORS];
for (i=2, j=0; j<FACTORS; i+=2, j++) {
factors[j] = factor;
factor *= -(x*x)/(i*(i+1));
}
for (j=FACTORS-1;j>=0;j--) {
result += factors[j];
}
return result;
}
This implementation loses accuracy if x falls outside the range of 0 to 2*PI. This can be fixed by calling x = fmod(x, 2*M_PI); at the start of the function to normalize the value.

Taylor Series to calculate cosine (getting output -0.000 for cosine(90))

I have written the following function for the Taylor series to calculate cosine.
double cosine(int x) {
x %= 360; // make it less than 360
double rad = x * (PI / 180);
double cos = 0;
int n;
for(n = 0; n < TERMS; n++) {
cos += pow(-1, n) * pow(rad, 2 * n) / fact(2 * n);
}
return cos;
}
My issue is that when i input 90 i get the answer -0.000000. (why am i getting -0.000 instead of 0.000?)
Can anybody explain why and how i can solve this issue?
I think it's due to the precision of double.
Here is the main() :
int main(void){
int y;
//scanf("%d",&y);
y=90;
printf("sine(%d)= %lf\n",y, sine(y));
printf("cosine(%d)= %lf\n",y, cosine(y));
return 0;
}

It's totally expected that you will not be able to get exact zero outputs for cosine of anything with floating point, regardless of how good your approach to computing it is. This is fundamental to how floating point works.
The mathematical zeros of cosine are odd multiples of pi/2. Because pi is irrational, it's not exactly representable as a double (or any floating point form), and the difference between the nearest neighboring values that are representable is going to be at least pi/2 times DBL_EPSILON, roughly 3e-16 (or corresponding values for other floating point types). For some odd multiples of pi/2, you might "get lucky" and find that it's really close to one of the two neighbors, but on average you're going to find it's about 1e-16 away. So your input is already wrong by 1e-16 or so.
Now, cosine has slope +1 or -1 at its zeros, so the error in the output will be roughly proportional to the error in the input. But to get an exact zero, you'd need error smaller than the smallest representable nonzero double, which is around 2e-308. That's nearly 300 orders of magnitude smaller than the error in the input.
While you coudl in theory "get lucky" and have some multiple if pi/2 that's really really close to the nearest representable double, the likelihood of this, just modelling it as random, is astronomically small. I believe there are even proofs that there is no double x for which the correctly-rounded value of cos(x) is an exact zero. For single-precision (float) this can be determined easily by brute force; for double that's probably also doable but a big computation.
As to why printf is printing -0.000000, it's just that the default for %f is 6 places after the decimal point, which is nowhere near enough to see the first significant digit. Using %e or %g, optionally with a large precision modifier, would show you an approximation of the result you got that actually retains some significance and give you an idea whether your result is good.

My issue is that when i input 90 i get the answer -0.000000. (why am i getting -0.000 instead of 0.000?)
cosine(90) is not precise enough to result in a value of 0.0. Use printf("cosine(%d)= %le\n",y, cosine(y)); (note the e) to see a more informative view of the result. Instead, cosine(90) is generating a negative result in the range [-0.0005 ... -0.0] and that is rounded to "-0.000" for printing.
Can anybody explain why and how i can solve this issue?
OP's cosine() lacks sufficient range reduction, which for degrees can be exact.
x %= 360; was a good first step, yet perform a better range reduction to a 90° width like [-45°...45°], [45°...135°], etc.
Also recommend: Use a Taylor series with sufficient terms (e.g. 10) and a good machine PI1. Form the terms more carefully than pow(rad, 2 * n) / fact(2 * n), which inject excessive error.
Example1, example2.
Other improvements possible, yet something to get OP started.
1 #define PI 3.1415926535897932384626433832795

Sine function using Taylor expansion (C Programming)

Here is the question..
This is what I've done so far,
#include <stdio.h>
#include <math.h>
long int factorial(int m)
{
if (m==0 || m==1) return (1);
else return (m*factorial(m-1));
}
double power(double x,int n)
{
double val=1;
int i;
for (i=1;i<=n;i++)
{
val*=x;
}
return val;
}
double sine(double x)
{
int n;
double val=0;
for (n=0;n<8;n++)
{
double p = power(-1,n);
double px = power(x,2*n+1);
long fac = factorial(2*n+1);
val += p * px / fac;
}
return val;
}
int main()
{
double x;
printf("Enter angles in degrees: ");
scanf("%lf",&x);
printf("\nValue of sine of %.2f is %.2lf\n",x,sine(x * M_PI / 180));
printf("\nValue of sine of %.2f from library function is %.2lf\n",x,sin(x * M_PI / 180));
return 0;
}
The problem is that the program works perfectly fine from 0 to 180 degrees, but beyond that it gives error.. Also when I increase the value of n in for (n=0;n<8;n++) beyond 8, i get significant error.. There is nothing wrong with the algorithm, I've tested it in my calculator, and the program seems to be fine as well.. I think the problem is due to the range of the data type.. what should i correct to get rid of this error?
Thanks..

You are correct that the error is due to the range of the data type. In sine(), you are calculating the factorial of 15, which is a huge number and does not fit in 32 bits (which is presumably what long int is implemented as on your system). To fix this, you could either:
Redefine factorial to return a double.
Rework your code to combine power and factorial into one loop, which alternately multiplies by x, and divides by i. This will be messier-looking but will avoid the possibility of overflowing a double (granted, I don't think that's a problem for your use case).

15! is indeed beyond range that a 32bit integer can hold. I'd use doubles throughout if I were you.
The taylor series for sin(x) converges more slowly for large values of x. For x outside -π,π. I'd add/subtract multiples of 2*π to get as small an x as possible.

You need range reduction. Note that a Taylor series is best near zero and that in the negative range it is the (negative) mirror image of it's positive range. So, in short: reduce the range (by the modula of 2 PI) to wrap it it the range where you have the highest accuracy. The range beyond 1/2 PI is getting less accurate, so you also want to use the formula: sin(1/2 PI + x) = sin(1/2 PI - x). For negative vales use the formula: sin(-x) = -sin(x). Now you only need to evaluate the interval 0 - 1/2 PI while spanning the whole range. Of course for VERY large values accuracy of the modula of 2 PI will suffer.

You may be having a problem with 15!.
I would print out the values for p, px, fac, and the value for the term for each iteration, and check them out.

You're only including 8 terms in an infinite series. If you think about it for a second in terms of a polynomial, you should see that you don't have a good enough fit for the entire curve.
The fact is that you only need to write the function for 0 <= x <=\pi; all other values will follow using these relationships:
sin(-x) = -sin(x)
and
sin(x+\pi;) = -sin(x)
and
sin(x+2n\pi) = sin(x)
I'd recommend that you normalize your input angle using these to make your function work for all angles as written.
There's a lot of inefficiency built into your code (e.g. you keep recalculating factorials that would easily fit in a table lookup; you use power() to oscillate between -1 and +1). But first make it work correctly, then make it faster.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight