Retaining maximum possible accuracy on float interpolation

Retaining maximum possible accuracy on float interpolation - c

We know that points X1 and X2 have respective points Y1 and Y2, so we can calculate Y for any X with:
X - X1 Y - Y1
------- = -------
X2 - X1 Y2 - Y1
We can get simple formula (A) out of that:
Y = (X - X1) * (Y2 - Y1) / (X2 - X1) + Y1;
This should be mathematically equivalent (B):
Y = (X - X1) / (X2 - X1) * (Y2 - Y1) + Y1;
For integer math formula A performs better as long as multiplication (X - X1) * (Y2 - Y1) result stays with in range of the type. Formula B won't work, because if X1 <= X <= X2, then division will always equal 0.
For floating point both should work, but I'm thinking that B would offer better accuracy since multiplication result will remain smaller.
Is my assumption about floating point accuracy correct?
Is there some floating point quirk that I am not taking into consideration?
Assume IEEE 754 floating point representation.
Note 1: I am interested in floating point case, integer math is pretty straight forward.
Note 2: Variables on FP formula may have non-integer values, but NaN and Infs are not within the scope of the question.

To solve the below for Y
X - X1 Y - Y1
------- = -------
X2 - X1 Y2 - Y1
Both (A) and (B) will behave similarity:
(A) Y = (X - offsetX) * deltaY / deltaX + offsetY;
(B) Y = (X - offsetX) / deltaX * deltaY + offsetY;
If points are originally whole numbers, "B ... multiplication result will remain smaller." may hold, but else-wise |deltaX| |deltaY| could both be less than 1 and then this assumption may fail.
To improve accuracy, consider the effects of subtracting 2 numbers (or adding 2 similar numbers that differ in sign). Code could choose X1,Y1 or X2,Y2 as the offset by reversing the roles of point1 and point2. Choosing the offset closest to X,Y will improve accuracy.
With FP math, * and / stress the exponential range allowable by the FP number: The precision of the product can be expected to be within a bit of the mathematically correct answer, but range may overflow.
+ and - stress the precision: The range is rarely an issue, but there may be large cancellation in the significands used to form the sum.
If all co-ordinate values originally are integers, recommend using a 2x wide integer math and deriving the best answer.
If the final result is to be integer-ized, insure code uses a iy = (int) round(Y);

Assuming no underflow or overflow occurs, they should be roughly equivalent in terms of accuracy: both multiplication and division will incur the same relative error, and as the error is roughly multiplicitive, the order in which you perform the operations won't make much difference.
If you know something about the relative magnitudes of the terms involved, you might be able to rearrange terms such that the subtractions are exact, which might reduce the error slightly.

In general, multiplications and divisions rarely cause a significant loss of precision. Because these are floating point numbers, with separate fields for the scale and significant digits, getting large intermediate results in itself isn't an issue. 2e100/3e100 and 2/3 are (for all intents and purposes) equally accurate.
On the other hand, additions or subtractions with a result much smaller in magnitude than the operands are much more common causes of loss of precision.
With this in mind, the two forms are basically equivalent. If your numbers are 'mainstream' (i.e. multiplication doesn't cause over/underflow), then you won't encounter any problems with either form. If you can't assume your numbers are mainstream, then you have to take all kinds of special precautions to get a good result.
Now, rather than consider the two forms (A) and (B), I would suggest selecting between (A) and (C):
Y = (X - X1) * (Y2 - Y1) / (X2 - X1) + Y1; (A)
Y = (X - X2) * (Y2 - Y1) / (X2 - X1) + Y2; (C)
and choosing the form for which the first factor X - X1 or X - X2 is smaller in magnitude. That way, if Y turns out to be small, you minimize the loss of precision.
For example, let's use
(X1,Y1) = (-100, -100)
(X2,Y2) = (0, 0)
X = 0.76
with three digits of precision. Then we get for (A):
Y = (0.76 - -100) * (0 - -100) / (0 - -100) + -100
= 101 * 100 / 100 - 100
= 1
while for (C), we get:
Y = (0.76 - 0) * (0 - -100) / (0 - -100) + 0
= 0.76 * 100 / 100 + 0
= 0.76
So, the quick answer to your question is:
Size of intermediate results in itself doesn't matter. It is not a reason to prefer (B) over (A).
Always consider addition and subtraction as more likely sources of loss of precision.

Related

Need help fixing an algorithm that approximates pi

I'm trying to write the C code for an algorithm that approximates pi. It's supposed to get the volume of a cube and the volume of a sphere inside that cube (the sphere's radius is 1/2 of the cube's side). Then I am supposed to divide the cube's volume by the sphere's and multiply by 6 to get pi.
It's working but it's doing something weird in the part that is supposed to get the volumes. I figure it's something to do the with delta I chose for the approximations.
With a cube of side 4 instead of giving me a volume of 64 it's giving me 6400. With the sphere instead of 33 it's giving me 3334. something.
Can someone figure it out? Here is the code (I commented the relevant parts):
#include <stdio.h>
int in_esfera(double x, double y, double z, double r_esfera){
double dist = (x-r_esfera)*(x-r_esfera) + (y-r_esfera)*(y-r_esfera) + (z-r_esfera)*(z-r_esfera);
return dist <= (r_esfera)*(r_esfera) ? 1 : 0;
}
double get_pi(double l_cubo){
double r_esfera = l_cubo/2;
double total = 0;
double esfera = 0;
//this is delta, for the precision. If I set it to 1E anything less than -1 the program continues endlessly. Is this normal?
double delta = (1E-1);
for(double x = 0; x < l_cubo; x+=delta){
printf("x => %f; delta => %.6f\n",x,delta);
for(double y = 0; y <l_cubo; y+=delta){
printf("y => %f; delta => %.6f\n",y,delta);
for(double z = 0; z < l_cubo; z+=delta){
printf("z => %f; delta => %.6f\n",z,delta);
total+=delta;
if(in_esfera(x,y,z,r_esfera))
esfera+=delta;
}
}
}
//attempt at fixing this
//esfera/=delta;
//total/=delta;
//
//This printf displays the volumes. Notice how the place of the point is off. If delta isn't a power of 10 the values are completely wrong.
printf("v_sphere = %.8f; v_cube = %.8f\n",esfera,total);
return (esfera)/(total)*6;
}
void teste_pi(){
double l_cubo = 4;
double pi = get_pi(l_cubo);
printf("%.8f\n",pi);
}
int main(){
teste_pi();
}

total+=delta;
if(in_esfera(x,y,z,r_esfera))
esfera+=delta;
total and esfera are three-dimensional volumes whereas delta is a one-dimensional length. If you were tracking units you'd have m3 on the left and m on the right. The units are incompatible.
To fix it, cube delta so that you're conceptually accumulating tiny cubes instead of tiny lines.
total+=delta*delta*delta;
if(in_esfera(x,y,z,r_esfera))
esfera+=delta*delta*delta;
Doing that fixes the output, and also works for any value of delta:
v_sphere = 33.37400000; v_cube = 64.00000000
3.12881250
Note that this algorithm "works" for arbitrary delta values, but it has severe accuracy issues. It's incredibly prone to rounding problems. It works best when delta is a power of two: 1/64.0 is better than 1/100.0, for example:
v_sphere = 33.50365448; v_cube = 64.00000000
3.14096761
Also, if you want your program to run faster get rid of all those printouts! Or at least the ones in the inner loops...

The thing is that multiplication over integers like a * b * c is the same as adding 1 + 1 + 1 + 1 + ... + 1 a * b * c times, right?
You're adding delta + delta + ... (x / delta) * (y / delta) * (z / delta) times. Or, in other words, (x * y * z) / (delta ** 3) times.
Now, that sum of deltas is the same as this:
delta * (1 + 1 + 1 + 1 + ...)
^^^^^^^^^^^^^^^^^^^^ (x * y * z) / (delta**3) times
So, if delta is a power of 10, (x * y * z) / (delta**3) will be an integer, and it'll be equal to the sum of 1's in parentheses (because it's the same as the product x * y * (z / (delta**3)), where the last term is an integer - see the very first sentence of this answer). Thus, your result will be the following:
delta * ( (x * y * z) / (delta ** 3) ) == (x * y * z) / (delta**2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the sum of ones
That's how you ended up calculating the product divided by delta squared.
To solve this, multiply all volumes by delta * delta.
However, I don't think it's possible to use this logic for deltas that aren't a power of 10. And indeed, the code will go all kinds of haywire for delta == 0.21 and l_cubo == 2, for example: you'll get 9.261000000000061 instead of 8.

Comparing the ratio of two values to 1

I'm working via a basic 'Programming in C' book.
I have written the following code based off of it in order to calculate the square root of a number:
#include <stdio.h>
float absoluteValue (float x)
{
if(x < 0)
x = -x;
return (x);
}
float squareRoot (float x, float epsilon)
{
float guess = 1.0;
while(absoluteValue(guess * guess - x) >= epsilon)
{
guess = (x/guess + guess) / 2.0;
}
return guess;
}
int main (void)
{
printf("SquareRoot(2.0) = %f\n", squareRoot(2.0, .00001));
printf("SquareRoot(144.0) = %f\n", squareRoot(144.0, .00001));
printf("SquareRoot(17.5) = %f\n", squareRoot(17.5, .00001));
return 0;
}
An exercise in the book has said that the current criteria used for termination of the loop in squareRoot() is not suitable for use when computing the square root of a very large or a very small number.
Instead of comparing the difference between the value of x and the value of guess^2, the program should compare the ratio of the two values to 1. The closer this ratio gets to 1, the more accurate the approximation of the square root.
If the ratio is just guess^2/x, shouldn't my code inside of the while loop:
guess = (x/guess + guess) / 2.0;
be replaced by:
guess = ((guess * guess) / x ) / 1 ; ?
This compiles but nothing is printed out into the terminal. Surely I'm doing exactly what the exercise is asking?

To calculate the ratio just do (guess * guess / x) that could be either higher or lower than 1 depending on your implementation. Similarly, your margin of error (in percent) would be absoluteValue((guess * guess / x) - 1) * 100
All they want you to check is how close the square root is. By squaring the number you get and dividing it by the number you took the square root of you are just checking how close you were to the original number.
Example:
sqrt(4) = 2
2 * 2 / 4 = 1 (this is exact so we get 1 (2 * 2 = 4 = 4))
margin of error = (1 - 1) * 100 = 0% margin of error
Another example:
sqrt(4) = 1.999 (lets just say you got this)
1.999 * 1.999 = 3.996
3.996/4 = .999 (so we are close but not exact)
To check margin of error:
.999 - 1 = -.001
absoluteValue(-.001) = .001
.001 * 100 = .1% margin of error

How about applying a little algebra? Your current criterion is:
|guess2 - x| >= epsilon
You are elsewhere assuming that guess is nonzero, so it is algebraically safe to convert that to
|1 - x / guess2| >= epsilon / guess2
epsilon is just a parameter governing how close the match needs to be, and the above reformulation shows that it must be expressed in terms of the floating-point spacing near guess2 to yield equivalent precision for all evaluations. But of course that's not possible because epsilon is a constant. This is, in fact, exactly why the original criterion gets less effective as x diverges from 1.
Let us instead write the alternative expression
|1 - x / guess2| >= delta
Here, delta expresses the desired precision in terms of the spacing of floating point values in the vicinity of 1, which is related to a fixed quantity sometimes called the "machine epsilon". You can directly select the required precision via your choice of delta, and you will get the same precision for all x, provided that no arithmetic operations overflow.
Now just convert that back into code.

Suggest a different point of view.
As this method guess_next = (x/guess + guess) / 2.0;, once the initial approximation is in the neighborhood, the number of bits of accuracy doubles. Example log2(FLT_EPSILON) is about -23, so 6 iterations are needed. (Think 23, 12, 6, 3, 2, 1)
The trouble with using guess * guess is that it may vanish, become 0.0 or infinity for a non-zero x.
To form a quality initial guess:
assert(x > 0.0f);
int expo;
float signif = frexpf(x, &expo);
float guess = ldexpf(signif, expo/2);
Now iterate N times (e.g. 6), (N based on FLT_EPSILON, FLT_DECIMAL_DIG or FLT_DIG.)
for (i=0; i<N; i++) {
guess = (x/guess + guess) / 2.0f;
}
The cost of perhaps an extra iteration is saved by avoiding an expensive termination condition calculation.

If code wants to compare a/b nearest to 1.0f
Simply use some epsilon factor like 1 or 2.
float a = guess;
float b = x/guess;
assert(b);
float q = a/b;
#define FACTOR (1.0f /* some value 1.0f to maybe 2,3 or 4 */)
if (q >= 1.0f - FLT_EPSILON*N && q <= 1.0f + FLT_EPSILON*N) {
close_enough();
}

First lesson in numerical analysis: for floating point numbers x+y has the potential for large relative errors, especially when the sum is near zero, but x*y has very limited relative errors.

Splines in integer arithmetic?

Splines (the piecewise cubic polynomial form) can be written as:
s = x - x[k]
y = y[k] + a[k]*s + b[k]*s*s + c[k]*s*s*s
where x[k] < x < x[k+1], the curve passes through each (x[k], y[k]) point, and a,b,c are arrays of coefficients describing the slope and shape. This all works fine in floating point, and there are plenty of ways to calculate a,b,c for different kinds of splines. However...
How can this be approximated in integer arithmetic?
One of the tricky parts is that any approximation should, ideally, be continuous, in other words using x=x[k+1] and the coefficients from the k-th segment, the result should be y[k+1] except for rounding errors. In other words, for a straight segment, y[k+1] == y[k] + a[k]*(x[k+1] - x[k]), and curvy segments only deviate from this in the middle but not at either end. This is guaranteed by construction in the case of floating point, but even a small coefficient change from rounding can throw it off quite a bit.
Another tricky part is that, in general, the magnitude of the higher-order coefficients is much smaller - but not always, esp. not at sharp "corners". It may still make sense to scale them up by the typical size of s to the power of whatever order they are, so they are not rounded of to zero as integers, but that would seem to trade off resolution in curvature for max possible corner sharpness.
First try at an integer version:
y = y[k] + (a[k] + (b[k] + c[k]*s)*s)*s
Then use integer multiply (intended for 16bit values, 32bit arithmetic):
#define q (1<<16)
#define mult(x, y) ((x * y) / q)
y = y[k] + mult(mult(mult(c[k], s) + b[k], s) + a[k], s)
This looks good in theory, but I'm not sure it's the best possible approach, or how to tell systematically what the best possible approach is.

Extending a line between 2 points in 3D space

Let's say I have 2 points in 3D space, one at:
x=2, y=3, z=5
and the second one at:
x=6, y=7, z=10
What is the fastest way, in code, to calculate the coordinates of a third point from extending (for example, doubling) the distance between those two points (relative to point one)?

If you want a point extended as far beyond (x2,y2,z2) as that is beyond (x1,y1,z1):
x3 = x2 + (x2 - x1) (= 10)
y3 = y2 + (y2 - y1) (= 11)
z3 = z2 + (z2 - z1) (= 15)
or:
(x2 * 2 - x1, y2 * 2 -y1, z2 * 2 - z1)
Simple as that.
If you want something other than double the length, you can scale the (x2 - x1)-type terms. For example, if you want it 50% longer than the current line, multiply them by 0.5 (+50%). If you want it three times longer, multiply them by two (+200%).
In terms of code that can perform this extension, something like this, which gives you an endpoint pDest that, along with, p1 forms a line percent times the size of p1-p2:
typedef struct {
double x;
double y;
double z;
} tPoint3d;
void extend (tPoint3d *p1, tPoint3d *p2, double percent, tPoint3d *pDest) {
percent -= 100.0; // what to ADD
percent /= 100.0; // make multiplier
pDest->x = p2->x + percent * (p2->x - p1->x); // scale each point
pDest->y = p2->y + percent * (p2->y - p1->y);
pDest->z = p2->z + percent * (p2->z - p1->z);
}

Computing fractional exponents in C

I'm trying to evaluate a^n, where a and n are rational numbers.
I don't want to use any predefined functions like sqrt() or pow()
So I'm trying to use Newton's Method to get an approximate solution using this approach:
3^0.2 = 3^(1/5) , so if x = 3^0.2, x^5 = 3.
Probably the best way to solve that (without a calculator but still
using the basic arithmetic operations) is to use "Newton's method".
Newton's method for solving the equation f(x)= 0 is to set up a
sequence of numbers xn defined by taking x0 as some initial "guess"
and then xn+1= xn- f(xn/f '(xn) where f '(x) is the derivative of f.
Posted on physicsforums
The problem with that method is that if I want to compute 5.2^0.33333, I'll need to find the roots for this equation x^10000 - 5.2^33333 = 0. I end up with huge numbers, and get inf and nan errors most of the time.
Can someone give me advice on how to solve this problem? Or, can someone provide another algorithm to compute a^n?

It seems your task is to calculate
⎛ xN ⎞(aN / aD)
⎜⎼⎼⎼⎼⎟ where xN,xD,aN,aD ∈ ℤ, xD,aD ≠ 0
⎝ xD ⎠
using only multiplications, divisions, additions, and subtractions, with Newton's method as the suggested method to implement.
The equation we're trying to solve (for y) is
(aN / aD)
y = (xN / xD) where y ∈ ℝ
Newton's method finds a root of a function. If we want to use it to solve the above, we substract the right side from the left side, to get a function whose zero gives us the y we want:
(aN/aD)
f(y) = y - (xN/xD) = 0
Not much help. I guess this is as far as you got? The point here is to not form that function just yet, because we don't have a way to calculate a rational power of a rational number!
First, let's decide that aD and xD are both positive. We can do that simply by negating both aN and aD if aD was negative (so sign of aN/aD does not change), and negating both xN and xD if xD was negative. Remember, by definition neither xD or aD is zero. Then, we can simply raise both sides to the aD'th power:
aD aN aN aN
y = (xN / xD) = xN / xD
We can even eliminate the division by multiplying both sides by the last term:
aD aN aN
y × xD = xN
Now, this looks quite promising! The function we get from this is
aD aN aN
f(y) = y xD - xN
Newton's method also requires the derivative, which is obviously
f(y) aD aN
⎼⎼⎼⎼ = df(y) = y xD y / aD
dy
Newton's method itself relies on iterating
f(y)
y = y - ⎼⎼⎼⎼⎼⎼
i+1 i df(y)
If you work out the math, you'll find that the iteration is just
aD
y[i] y[i] xN
y[i+1] = y[i] - ⎼⎼⎼⎼ + ⎼⎼⎼⎼⎼⎼⎼⎼⎼⎼⎼⎼⎼⎼
aD aD aN
aD y[i] xD
You don't need to keep all the y values in memory; it is enough to remember the last one, and stop iterating when their difference is small enough.
You do still have exponentiation above, but now they are integer exponentiation only, i.e.
aD
xN = xN × xN × .. × xN
╰───────┬───────╯
aD times
which you can do very simply, for example just by multiplying the argument by itself the desired number of times, e.g. in C,
double ipow(const double base, const int exponent)
{
double result = 1.0;
int i;
for (i = 0; i < exponent; i++)
result *= base;
return result;
}
There are more efficient methods to do integer exponentiation, but the above function should be perfectly acceptable for this.
The final problem is to pick the initial y so that you get convergence. You cannot use 0, because (a power of) y is used as a denominator in the division; you'd get division by zero error. Personally, I'd check whether the result ought to be positive or negative, and smaller than or greater than one in magnitude; two rules overall to pick a safe initial y.
Questions?

You can use the generalized binomial theorem. Substitute y=1 and x=a-1. You would want to truncate the infinite series after enough terms, based on the desired accuracy. To be able to link number of terms to accuracy, you would need to ensure that the x^r terms are decreasing in absolute value. So, depending on the value of a and n, you should apply the formula to compute one of a^n and a^(-n) and use that to get your desired result.

A solution for raising an integer number to a power is:
int poweri (int x, unsigned int y)
{
int temp;
if (y == 0)
return 1;
temp = poweri (x, y / 2);
if ((y % 2) == 0)
return temp * temp;
else
return x * temp * temp;
}
However, the square root doesn't provide as clean of a closed solution. There is a good bit of background to be found at wikipedia-square root and at Wolfram Mathworks Square Root Algorithms Both provide several methods that will meet your needs, you just have to choose the one that fits your purpose.
With slight modification, this routine from wikipedia (modified to return the square root and refine accuracy) returns a surprisingly accurate square root. Yes, there will be howls about the use of a union, and it is only valid where integer and float storage are equivalent, but if you are hacking your own square root, this is relatively efficient:
float sqrt_f (float x)
{
float xhalf = 0.5f*x;
union
{
float x;
int i;
} u;
u.x = x;
u.i = 0x5f3759df - (u.i >> 1);
/* The next line can be repeated any number of times to increase accuracy */
// u.x = u.x * (1.5f - xhalf * u.x * u.x);
int i = 10;
while (i--)
u.x *= 1.5f - xhalf * u.x * u.x;
return 1.0f / u.x;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Retaining maximum possible accuracy on float interpolation - c

Related

Need help fixing an algorithm that approximates pi

Comparing the ratio of two values to 1

Splines in integer arithmetic?

Extending a line between 2 points in 3D space

Computing fractional exponents in C

Categories

Resources