Optimize floating-point calculations

Optimize floating-point calculations - c

I have a task to make the following function as precise (the speed is not the aim) as possible. I have to use float and the method of middle rectangles. Could you suggest something? Actually, I think, it's all about minimization of float rounding errors. That's what I've done:
typedef float T;
T integrate(T left, T right, long N, T (*func)(T)) {
long i = 0;
T result = 0.0;
T interval = right - left;
for(i = 0; i < N; i++) {
result += func(left + interval * (i + 0.5) / N) * interval / N;
}
return result;
}

There are lots of ways you could avoid or compensate for floating-point rounding (MM's suggestion, using Kahan summation, etc...). However, there's no reason to do so, because the rounding errors are absolutely dwarfed by the error of the integration scheme; you won't get a more accurate integral, you'll get a more accurate approximation of the incorrect result computed by the midpoint rule. Any such effort is entirely wasted except in extremely specialized circumstances.

I have a task to make the following function as precise as possible
You say that you have to use float, so I assume the question isn't about rounding, but rather about computing the integral more accurately.
I also assume that simply increasing N is not an option.
Instead of using the mid-point rule, my suggestion is to consider using a higher-order quadrature rule (trapezoid, Simpson's etc).

Try this:
{
long i = 0;
T result = 0.0;
T interval = right - left;
for(i = 0; i < N; i++) {
result += func(left + interval * (i + 0.5) / N);
}
return result * interval / N;
}

If you want to compute an integral precisely, go read up on integration schemes. Some home-knit routine won't give any kind of precision
The book "Numerical recipes" (there are several versions, one for C) is highly regarded. Haven't looked at it personally.

Related

Simple integration that depends on floating point equality

I have the following very-crude integration calculator:
// definite integrate on one variable
// using basic trapezoid approach
float integrate(float start, float end, float step, float (*func)(float x))
{
if (start >= (end-step))
return 0;
else {
float x = start; // make it a bit more math-like
float segment = step * (func(x) + func(x+step))/2;
return segment + integrate(x+step, end, step, func);
}
}
And an example usage:
static float square(float x) {return x*x;}
int main(void)
{
// Integral x^2 from 0->2 should be ~ 2.6
float start=0.0, end=2.0, step=0.01;
float answer = integrate(start, end, step, square);
printf("The integral from %.2f to %.2f for X^2 = %.2f\n", start, end, answer );
}
$ run
The integral from 0.00 to 2.00 for X^2 = 2.67
What happens if the equality check at start >= (end-step) doesn't work? For example, if it evaluates something to 2.99997 instead of 3 and so does another loop (or one less loop). Is there a way to prevent that, or do most math-type calculators just work in decimals or some extension to the 'normal' floating points?

If you are given step, one way to write a loop (and you should use a loop for this, not recursion) is:
float x;
for (float i = 0; (x = start + i*step) < end - step/2; ++i)
…
Some points about this:
We keep an integer count with i. As long as there are a reasonable number of steps, there will be no floating-point rounding error in this. (We could make i and int, but float can count integer values perfectly well, and using float avoids an int-to-float conversion in i*step.)
Instead of incrementing x (or start as it is passed by recursion) repeatedly, we recalculate it each time as start + i*step. This has only two possible rounding errors, in the multiplication and in the addition, so it avoids accumulating errors over repeated additions.
We use end - step/2 as the threshold. This allows us to catch the desired endpoint even if the calculated x drifts as far away from end as end - step/2. And that is about the best we can do, because if it is drifting farther than half a step away from the ideally spaced points, we cannot tell if it has drifted +step/2 from end-step or -step/2 from end.
This presumes that step is an integer division of end-start, or pretty close to it, so that there are a whole number of steps in the loop. If it is not, the loop should be redesigned a bit to stop one step earlier and then calculate a step of partial width at the end.
At the beginning, I mentioned being given step. An alternative is you might be given a number of steps to use, and then the step width would be calculated from that. In that case, we would use an integer number of steps to control the loop. The loop termination condition would not involve floating-point rounding at all. We could calculate x as (float) i / NumberOfSteps * (end-start) + start.

Two improvements can be made easily.
Using recursion is a bad idea. Each additional call creates a new stack frame. For a sufficiently large number of steps, you will trigger a Stack Overflow. Use a loop instead.
Normally, you would avoid the rounding problem by using start, end and n, the number of steps. The location of the kth interval would be at start + k * (end - start) / n;
So you could rewrite your function as
float integrate(float start, float end, int n, float (*func)(float x))
{
float next = start;
float sum = 0.0f;
for(int k = 0; k < n; k++) {
float x = next;
next = start + k * (end - start) / n;
sum += 0.5f * (next - x) * (func(x) + func(next));
}
return sum;
}

What is a more accurate algorithm I can use to calculate the sine of a number?

I have this code that calculates a guess for sine and compares it to the standard C library's (glibc's in my case) result:
#include <stdio.h>
#include <math.h>
double double_sin(double a)
{
a -= (a*a*a)/6;
return a;
}
int main(void)
{
double clib_sin = sin(.13),
my_sin = double_sin(.13);
printf("%.16f\n%.16f\n%.16f\n", clib_sin, my_sin, clib_sin-my_sin);
return 0;
}
The accuracy for double_sin is poor (about 5-6 digits). Here's my output:
0.1296341426196949
0.1296338333333333
0.0000003092863615
As you can see, after .12963, the results differ.
Some notes:
I don't think the Taylor series will work for this specific situation, the factorials required for greater accuracy aren't able to be stored inside an unsigned long long.
Lookup tables are not an option, they take up too much space and generally don't provide any information on how to calculate the result.
If you use magic numbers, please explain them (although I would prefer if they were not used).
I would greatly prefer an algorithm is easily understandable and able to be used as a reference over one that is not.
The result does not have to be perfectly accurate. A minimum would be the requirements of IEEE 754, C, and/or POSIX.
I'm using the IEEE-754 double format, which can be relied on.
The range supported needs to be at least from -2*M_PI to 2*M_PI. It would be nice if range reduction were included.
What is a more accurate algorithm I can use to calculate the sine of a number?
I had an idea about something similar to Newton-Raphson, but for calculating sine instead. However, I couldn't find anything on it and am ruling this possibility out.

You can actually get pretty close with the Taylor series. The trick is not to calculate the full factorial on each iteration.
The Taylor series looks like this:
sin(x) = x^1/1! - x^3/3! + x^5/5! - x^7/7!
Looking at the terms, you calculate the next term by multiplying the numerator by x^2, multiplying the denominator by the next two numbers in the factorial, and switching the sign. Then you stop when adding the next term doesn't change the result.
So you could code it like this:
double double_sin(double x)
{
double result = 0;
double factor = x;
int i;
for (i=2; result+factor!=result; i+=2) {
result += factor;
factor *= -(x*x)/(i*(i+1));
}
return result;
}
My output:
0.1296341426196949
0.1296341426196949
-0.0000000000000000
EDIT:
The accuracy can be increased further if the terms are added in the reverse direction, however this means computing a fixed number of terms:
#define FACTORS 30
double double_sin(double x)
{
double result = 0;
double factor = x;
int i, j;
double factors[FACTORS];
for (i=2, j=0; j<FACTORS; i+=2, j++) {
factors[j] = factor;
factor *= -(x*x)/(i*(i+1));
}
for (j=FACTORS-1;j>=0;j--) {
result += factors[j];
}
return result;
}
This implementation loses accuracy if x falls outside the range of 0 to 2*PI. This can be fixed by calling x = fmod(x, 2*M_PI); at the start of the function to normalize the value.

How to approximate a Poisson distribution $P(n,lambda)$

I have to implement an algorithm to evaluate a sum of poisson functions, each one with multiplying constants:
Where C(k) are positive constants<1, cut is a cutoff because in principle the sum should take infinite numbers of k, and lambda is a number that may vary in my case from 20 to 100. I've tried a straight forward implementation in my code:
#include<quadmath.h>
... //Definitions of lambda and c[k]...
long double sum=0;
for(int k=0;k<cutoff;++k)
{
sum=sum + c[k] powq(lambda,k)/tgamma(k+1)* (1.0/expq(lambda));
}
But I am not quite satisfied. I've searched on "Numerical recipes" for a good approach to evalutation of a poisson distribution, but I didn't find anything about that.
Are there better ways to do this?
Edit: to be clear, I'm looking for the most precise way to approximate the probaility of large events, given a poisson distribution, without computing awkward (lambda^k)/k! Factors!

Well, a simple improvement will be to calculate by hand, and cache lambda^k and (k+1)!, since their value in the previous iteration can be used to quickly calculate the respective ones in the current iteration, with an O(1) calculation.
Also, since 1.0/exp(lambda) is a constant, you should calculate it once in advance 1
#include<quadmath.h>
... //Definitions of lambda and c[k]...
const long double e_coeff = 1.0 / expq(lambda);
long double inv_k_factorial = 1.0l;
long double lambda_pow_k = 1.0l;
long double sum = 0.0l;
for(int k=0; k < cutoff; ++k)
{
lambda_pow_k *= lambda;
inv_k_factorial /= (k + 1);
sum += (c[k] * lambda_pow_k * inv_k_factorial);
}
sum *= e_coeff;
So now the three function calls and their respective overhead are completely gone from your loop.
Now, I've attempted to use the same data types as you did when writing your question. Since your comment indicates that lambda is greater than 1.0 (no relative error growth due to a quickly diminishing lambda_pow_k, I believe) Any significance lost here would depend on the limits of long double, which is either good or bad, depending on your concrete needs.
Compilers are clever nowadays. So it could be optimized like that any way, but I think it's best to leave less obvious optimizations to the optimizer. Your code shouldn't suffer in performance even when handed to a non-optimizing compiler.

Since the Poisson probabilities obey the simple recurrence
P(k,lam) = lam/k * P(k-1,lam)
one possibility would be to use something like Horner's rule for polynomials. That is:
Sum{k|C[k]*P(k,lam)} = exp(-lam)*(C[0]+(lam/1)*(C[1]+(lam/2)*(..))..)
or
P = c[cut]
for k=cut-1 .. 0
P = P*lam/(k+1) + C[k]
P *= exp(-lam)

Round-off error when calculating a geometric mean [duplicate]

I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be
double geometric_mean(std::vector<double> const&data) // failure
{
auto product = 1.0;
for(auto x:data) product *= x;
return std::pow(product,1.0/data.size());
}
However, this may well fail because of underflow or overflow in the accumulated product (note: long double doesn't really avoid this problem). So, the next option is to sum-up the logarithms:
double geometric_mean(std::vector<double> const&data)
{
auto sumlog = 0.0;
for(auto x:data) sum_log += std::log(x);
return std::exp(sum_log/data.size());
}
This works, but calls std::log() for every element, which is potentially slow. Can I avoid that? For example by keeping track of (the equivalent of) the exponent and the mantissa of the accumulated product separately?

The "split exponent and mantissa" solution:
double geometric_mean(std::vector<double> const & data)
{
double m = 1.0;
long long ex = 0;
double invN = 1.0 / data.size();
for (double x : data)
{
int i;
double f1 = std::frexp(x,&i);
m*=f1;
ex+=i;
}
return std::pow( std::numeric_limits<double>::radix,ex * invN) * std::pow(m,invN);
}
If you are concerned that ex might overflow you can define it as a double instead of a long long, and multiply by invN at every step, but you might lose a lot of precision with this approach.
EDIT For large inputs, we can split the computation in several buckets:
double geometric_mean(std::vector<double> const & data)
{
long long ex = 0;
auto do_bucket = [&data,&ex](int first,int last) -> double
{
double ans = 1.0;
for ( ;first != last;++first)
{
int i;
ans *= std::frexp(data[first],&i);
ex+=i;
}
return ans;
};
const int bucket_size = -std::log2( std::numeric_limits<double>::min() );
std::size_t buckets = data.size() / bucket_size;
double invN = 1.0 / data.size();
double m = 1.0;
for (std::size_t i = 0;i < buckets;++i)
m *= std::pow( do_bucket(i * bucket_size,(i+1) * bucket_size),invN );
m*= std::pow( do_bucket( buckets * bucket_size, data.size() ),invN );
return std::pow( std::numeric_limits<double>::radix,ex * invN ) * m;
}

I think I figured out a way to do it, it combined the two routines in the question, similar to Peter's idea. Here is an example code.
double geometric_mean(std::vector<double> const&data)
{
const double too_large = 1.e64;
const double too_small = 1.e-64;
double sum_log = 0.0;
double product = 1.0;
for(auto x:data) {
product *= x;
if(product > too_large || product < too_small) {
sum_log+= std::log(product);
product = 1;
}
}
return std::exp((sum_log + std::log(product))/data.size());
}
The bad news is: this comes with a branch. The good news: the branch predictor is likely to get this almost always right (the branch should only rarely be triggered).
The branch could be avoided using Peter's idea of a constant number of terms in the product. The problem with that is that overflow/underflow may still occur within only a few terms, depending on the values.

You may be able to accelerate this by multiplying numbers as in your original solution and only converting to logarithms every certain number of multiplications (depending on the size of your initial numbers).

A different approach which would give better accuracy and performance than the logarithm method would be to compensate out-of-range exponents by a fixed amount, maintaining an exact logarithm of the cancelled excess. Like so:
const int EXP = 64; // maximal/minimal exponent
const double BIG = pow(2, EXP); // overflow threshold
const double SMALL = pow(2, -EXP); // underflow threshold
double product = 1;
int excess = 0; // number of times BIG has been divided out of product
for(int i=0; i<n; i++)
{
product *= A[i];
while(product > BIG)
{
product *= SMALL;
excess++;
}
while(product < SMALL)
{
product *= BIG;
excess--;
}
}
double mean = pow(product, 1.0/n) * pow(BIG, double(excess)/n);
All multiplications by BIG and SMALL are exact, and there's no calls to log (a transcendental, and therefore particularly imprecise, function).

There is simple idea to reduce computation and also to prevent overflow. You can group together numbers say atleast two at time and calculate their log and then evaluate their sum.
log(abcde) = 5*log(K)
log(ab) + log(cde) = 5*log(k)

Summing logs to compute products stably is perfectly fine, and rather efficient (if this is not enough: there are ways to get vectorized logarithms with a few SSE operations -- there are also Intel MKL's vector operations).
To avoid overflow, a common technique is to divide every number by the maximum or minimum magnitude entry beforehand (or sum log differences to the log max or log min). You can also use buckets if the numbers vary a lot (eg. sum the log of small numbers and large numbers separately). Note that typically neither of this is needed except for very large sets since the log of a double is never huge (between say -700 and 700).
Also, you need to keep track of the signs separately.
Computing log x keeps typically the same number of significant digits as x, except when x is close to 1: you want to use std::log1p if you need to compute prod(1 + x_n) with small x_n.
Finally, if you have roundoff error problems when summing, you can use Kahan summation or variants.

Instead of using logarithms, which are very expensive, you can directly scale the results by powers of two.
double geometric_mean(std::vector<double> const&data) {
double huge = scalbn(1,512);
double tiny = scalbn(1,-512);
int scale = 0;
double product = 1.0;
for(auto x:data) {
if (x >= huge) {
x = scalbn(x, -512);
scale++;
} else if (x <= tiny) {
x = scalbn(x, 512);
scale--;
}
product *= x;
if (product >= huge) {
product = scalbn(product, -512);
scale++;
} else if (product <= tiny) {
product = scalbn(product, 512);
scale--;
}
}
return exp2((512.0*scale + log2(product)) / data.size());
}

Approximation of arcsin in C

I've got a program that calculates the approximation of an arcsin value based on Taylor's series.
My friend and I have come up with an algorithm which has been able to return the almost "right" values, but I don't think we've done it very crisply. Take a look:
double my_asin(double x)
{
double a = 0;
int i = 0;
double sum = 0;
a = x;
for(i = 1; i < 23500; i++)
{
sum += a;
a = next(a, x, i);
}
}
double next(double a, double x, int i)
{
return a*((my_pow(2*i-1, 2)) / ((2*i)*(2*i+1)*my_pow(x, 2)));
}
I checked if my_pow works correctly so there's no need for me to post it here as well. Basically I want the loop to end once the difference between the current and next term is more or equal to my EPSILON (0.00001), which is the precision I'm using when calculating a square root.
This is how I would like it to work:
while(my_abs(prev_term - next_term) >= EPSILON)
But the function double next is dependent on i, so I guess I'd have to increment it in the while statement too. Any ideas how I should go about doing this?
Example output for -1:
$ -1.5675516116e+00
Instead of:
$ -1.5707963268e+00
Thanks so much guys.

Issues with your code and question include:
Your image file showing the Taylor series for arcsin has two errors: There is a minus sign on the x5 term instead of a plus sign, and the power of x is shown as xn but should be x2n+1.
The x factor in the terms of the Taylor series for arcsin increases by x2 in each term, but your formula a*((my_pow(2*i-1, 2)) / ((2*i)*(2*i+1)*my_pow(x, 2))) divides by x2 in each term. This does not matter for the particular value -1 you ask about, but it will produce wrong results for other values, except 1.
You ask how to end the loop once the difference in terms is “more or equal to” your epsilon, but, for most values of x, you actually want less than (or, conversely, you want to continue, not end, while the difference is greater than or equal to, as you show in code).
The Taylor series is a poor way to evaluate functions because its error increases as you get farther from the point around which the series is centered. Most math library implementations of functions like this use a minimax series or something related to it.
Evaluating the series from low-order terms to high-order terms causes you to add larger values first, then smaller values later. Due to the nature of floating-point arithmetic, this means that accuracy from the smaller terms is lost, because it is “pushed out” of the width of the floating-point format by the larger values. This effect will limit how accurate any result can be.
Finally, to get directly to your question, the way you have structured the code, you directly update a, so you never have both the previous term and the next term at the same time. Instead, create another double b so that you have an object b for a previous term and an object a for the current term, as shown below.
Example:
double a = x, b, sum = a;
int i = 0;
do
{
b = a;
a = next(a, x, ++i);
sum += a;
} while (abs(b-a) > threshold);

using Taylor series for arcsin is extremly imprecise as the stuff converge very badly and there will be relatively big differencies to the real stuff for finite number of therms. Also using pow with integer exponents is not very precise and efficient.
However using arctan for this is OK
arcsin(x) = arctan(x/sqrt(1-(x*x)));
as its Taylor series converges OK on the <0.0,0.8> range all the other parts of the range can be computed through it (using trigonometric identities). So here my C++ implementation (from my arithmetics template):
T atan (const T &x) // = atan(x)
{
bool _shift=false;
bool _invert=false;
bool _negative=false;
T z,dz,x1,x2,a,b; int i;
x1=x; if (x1<0.0) { _negative=true; x1=-x1; }
if (x1>1.0) { _invert=true; x1=1.0/x1; }
if (x1>0.7) { _shift=true; b=::sqrt(3.0)/3.0; x1=(x1-b)/(1.0+(x1*b)); }
x2=x1*x1;
for (z=x1,a=x1,b=1,i=1;i<1000;i++) // if x1>0.8 convergence is slow
{
a*=x2; b+=2; dz=a/b; z-=dz;
a*=x2; b+=2; dz=a/b; z+=dz;
if (::abs(dz)<zero) break;
}
if (_shift) z+=pi/6.0;
if (_invert) z=0.5*pi-z;
if (_negative) z=-z;
return z;
}
T asin (const T &x) // = asin(x)
{
if (x<=-1.0) return -0.5*pi;
if (x>=+1.0) return +0.5*pi;
return ::atan(x/::sqrt(1.0-(x*x)));
}
Where T is any floating point type (float,double,...). As you can see you need sqrt(x), pi=3.141592653589793238462643383279502884197169399375105, zero=1e-20 and +,-,*,/ operations implemented. The zero constant is the target precision.
So just replace T with float/double and ignore the :: ...

so I guess I'd have to increment it in the while statement too
Yes, this might be a way. And what stops you?
int i=0;
while(condition){
//do something
i++;
}
Another way would be using the for condition:
for(i = 1; i < 23500 && my_abs(prev_term - next_term) >= EPSILON; i++)

Your formula is wrong. Here is the correct formula: http://scipp.ucsc.edu/~haber/ph116A/taylor11.pdf.
P.S. also note that your formula and your series are not correspond to each other.
You can use while like this:
while( std::abs(sum_prev - sum) < 1e-15 )
{
sum_prev = sum;
sum += a;
a = next(a, x, i);
}