Is it safe to cascade hypot()? - c

I would like to compute the norm (length) of three- and four-dimensional vectors. I'm using double-precision floating point numbers and want to be careful to avoid unnecessary overflow or underflow.
The C math library provides hypot(x,y) for computing the norm of two-dimensional vectors, being careful to avoid underflow/overflow in intermediate calculations.
My question: Is it safe to use hypot(x, hypot(y, z)) and hypot(hypot(w, x), hypot(y, z)) to compute the lengths of three- and four-dimensional vectors, respectively?

It's safe, but it's wasteful: you only need to compute sqrt() once, but when you cascade hypot(), you will call sqrt() for every call to hypot(). Ordinarily I might not be concerned about the performance, but this may also degrade the precision of the result. You could write your own:
double hypot3(double x, double y, double z) {
return sqrt(x*x + y*y + z*z);
}
etc. This will be faster and more accurate. I don't think anyone would be confused when they see hypot3() in your code.
The standard library hypot() may have tricks to avoid overflow, but you may not be concerned about it. Ordinarily, hypot() is more accurate than sqrt(x*x + y*y). See e_hypot.c in the GLibC source code.

It safe (almost) to use hypot(x, hypot(y, z)) and hypot(hypot(w, x), hypot(y, z)) to compute the lengths of three- and four-dimensional vectors.
C does not strongly specify that hypot() must work for a double x, y that have a finite double answer. It has weasel words of "without undue overflow or underflow".
Yet given that hypot(x, y) works, a reasonable hypot() implementation will perform hypot(hypot(w, x), hypot(y, z)) as needed. There is only 1 increment (at the low end) /decrement (at the high end) of binary exponent range lost when with 4-D vs. 2-D.
Concerning speed, precision, and range, code profile against sqrtl((long double) w*w + (long double) x*x + (long double) y*y + (long double) z*z) as an alternative, but that seems only needed with select coding goals.

I've done some experiments with this sort of thing. In particular I looked at a plain implementation, an implementation using hypots and (a C translation of the reference version of) the BLAS function DNRM2.
I found that as regards over and underflow, the BLAS and hypot implementations were the same (in my tests) and far superior to the plain implementation. As regards time, for high (hundreds) dimensioned vectors, the BLAS was about 6 times slower than the plain, while the hypot was 3 times slower than BLAS. The time differences were a bit smaller for smaller dimensions.

Should code not be able to use hypot() nor wider precision types, a slow method examines the exponents using frexp() and scales the argumnets #greggo.
#include <math.h>
double nibot_norm(double w, double x, double y, double z) {
// Sort the values by some means
if (fabs(x) < fabs(w)) return nibot_norm(x, w, y, z);
if (fabs(y) < fabs(x)) return nibot_norm(w, y, x, z);
if (fabs(z) < fabs(y)) return nibot_norm(w, x, z, y);
if (z == 0.0) return 0.0; // all zero case
// Scale z to exponent half-way 1.0 to MAX_DOUBLE/4
// and w,x,y the same amount
int maxi;
frexp(DBL_MAX, &maxi);
int zi;
frexp(z, &zi);
int pow2scale = (maxi / 2 - 2) - zi;
// NO precision loss expected so far.
// except w,x,y may become 0.0 if _far_ less than z
w = ldexp(w, pow2scale);
x = ldexp(x, pow2scale);
y = ldexp(y, pow2scale);
z = ldexp(z, pow2scale);
// All finite values in range of squaring except for values
// greatly insignificant to z (e.g. |z| > |x|*1e300)
double norm = sqrt(((w * w + x * x) + y * y) + z * z);
// Restore scale
return ldexp(norm, -pow2scale);
}
Test Code
#include <float.h>
#include <stdio.h>
#ifndef DBL_TRUE_MIN
#define DBL_TRUE_MIN DBL_MIN*DBL_EPSILON
#endif
void nibot_norm_test(double w, double x, double y, double z, double expect) {
static int dig = DBL_DECIMAL_DIG - 1;
printf(" w:%.*e x:%.*e y:%.*e z:%.*e\n", dig, w, dig, x, dig, y, dig, z);
double norm = nibot_norm(w, x, y, z);
printf("expect:%.*e\n", dig, expect);
printf("actual:%.*e\n", dig, norm);
if (expect != norm) puts("Different");
}
int main(void) {
nibot_norm_test(0, 0, 0, 0, 0);
nibot_norm_test(10 / 7., 4 / 7., 2 / 7., 1 / 7., 11 / 7.);
nibot_norm_test(DBL_MAX, 0, 0, 0, DBL_MAX);
nibot_norm_test(DBL_MAX / 2, DBL_MAX / 2, DBL_MAX / 2, DBL_MAX / 2, DBL_MAX);
nibot_norm_test(DBL_TRUE_MIN, 0, 0, 0, DBL_TRUE_MIN);
nibot_norm_test(DBL_TRUE_MIN, DBL_TRUE_MIN, DBL_TRUE_MIN,
DBL_TRUE_MIN, DBL_TRUE_MIN * 2);
return 0;
}
Results
w:0.00000000000000000e+00 x:0.00000000000000000e+00 y:0.00000000000000000e+00 z:0.00000000000000000e+00
expect:0.00000000000000000e+00
actual:0.00000000000000000e+00
w:1.42857142857142860e+00 x:5.71428571428571397e-01 y:2.85714285714285698e-01 z:1.42857142857142849e-01
expect:1.57142857142857140e+00
actual:1.57142857142857140e+00
w:1.79769313486231571e+308 x:0.00000000000000000e+00 y:0.00000000000000000e+00 z:0.00000000000000000e+00
expect:1.79769313486231571e+308
actual:1.79769313486231571e+308
w:8.98846567431157854e+307 x:8.98846567431157854e+307 y:8.98846567431157854e+307 z:8.98846567431157854e+307
expect:1.79769313486231571e+308
actual:1.79769313486231571e+308
w:4.94065645841246544e-324 x:0.00000000000000000e+00 y:0.00000000000000000e+00 z:0.00000000000000000e+00
expect:4.94065645841246544e-324
actual:4.94065645841246544e-324
w:4.94065645841246544e-324 x:4.94065645841246544e-324 y:4.94065645841246544e-324 z:4.94065645841246544e-324
expect:9.88131291682493088e-324
actual:9.88131291682493088e-324

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

atan2 for two sinusoids of arbitrary phase shift?

I'm trying to implement an atan2-like function to map two input sinusoidal signals of arbitrary relative phase shift to a single output signal that linearly goes from 0 to 2π. atan2 normally assumes two signals with a 90 deg phase shift.
Given y0(x) = sin(x) and y1 = sin(x + phase), where phase is a fixed non-zero value, how can I implement a way to return x modulo 2π?
atan2 returns the angle of a 2d vector. Your code does not handle such scaling properly. But no worries, it's actually very easy to reduce your problem to an atan2 that would handle everything nicely.
Notice that calculating sin(x) and sin(x + phase) is the same as projecting a point (cos(x), sin(x)) onto the axes (0, 1) and (sin(phase), cos(phase)). This is the same as taking dot products with those axes, or transforming the coordinate system from the standard orthogonal basis into the skewed one. This suggests a simple solution: inverse the transformation to get the coordinates in the orthogonal basis and then use atan2.
Here's a code that does that:
double super_atan2(double x0, double x1, double a0, double a1) {
double det = sin(a0 - a1);
double u = (x1*sin(a0) - x0*sin(a1))/det;
double v = (x0*cos(a1) - x1*cos(a0))/det;
return atan2(v, u);
}
double duper_atan2(double y0, double y1, double phase) {
const double tau = 6.28318530717958647692; // https://tauday.com/
return super_atan2(y0, y1, tau/4, tau/4 - phase);
}
super_atan2 gets the angles of the two projection axes, duper_atan2 solves the problem exactly as you stated.
Also notice that the calculation of det is not strictly necessary. It is possible to replace it by fmod and copysign (we still need the correct sign of u and v).
Derivation:
In code:
// assume phase != k * pi, for any integer k
double f (double y0, double y1, double phase)
{
double u = (- y0 * cos(phase) + y1) / sin(phase);
double v = y0;
double x = atan2 (v, u);
return (x < 0) ? (x + 2 * M_PI) : x;
}

Efficient implementation of natural logarithm (ln) and exponentiation

I'm looking for implementation of log() and exp() functions provided in C library <math.h>. I'm working with 8 bit microcontrollers (OKI 411 and 431). I need to calculate Mean Kinetic Temperature. The requirement is that we should be able to calculate MKT as fast as possible and with as little code memory as possible. The compiler comes with log() and exp() functions in <math.h>. But calling either function and linking with the library causes the code size to increase by 5 Kilobytes, which will not fit in one of the micro we work with (OKI 411), because our code already consumed ~12K of available ~15K code memory.
The implementation I'm looking for should not use any other C library functions (like pow(), sqrt() etc). This is because all library functions are packed in one library and even if one function is called, the linker will bring whole 5K library to code memory.
EDIT
The algorithm should be correct up to 3 decimal places.
Using Taylor series is not the simplest neither the fastest way of doing this. Most professional implementations are using approximating polynomials. I'll show you how to generate one in Maple (it is a computer algebra program), using the Remez algorithm.
For 3 digits of accuracy execute the following commands in Maple:
with(numapprox):
Digits := 8
minimax(ln(x), x = 1 .. 2, 4, 1, 'maxerror')
maxerror
Its response is the following polynomial:
-1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x
With the maximal error of: 0.000061011436
We generated a polynomial which approximates the ln(x), but only inside the [1..2] interval. Increasing the interval is not wise, because that would increase the maximal error even more. Instead of that, do the following decomposition:
So first find the highest power of 2, which is still smaller than the number (See: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?). That number is actually the base-2 logarithm. Divide with that value, then the result gets into the 1..2 interval. At the end we will have to add n*ln(2) to get the final result.
An example implementation for numbers >= 1:
float ln(float y) {
int log2;
float divisor, x, result;
log2 = msb((int)y); // See: https://stackoverflow.com/a/4970859/6630230
divisor = (float)(1 << log2);
x = y / divisor; // normalized value between [1.0, 2.0]
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
result += ((float)log2) * 0.69314718; // ln(2) = 0.69314718
return result;
}
Although if you plan to use it only in the [1.0, 2.0] interval, then the function is like:
float ln(float x) {
return -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
}
The Taylor series for e^x converges extremely quickly, and you can tune your implementation to the precision that you need. (http://en.wikipedia.org/wiki/Taylor_series)
The Taylor series for log is not as nice...
If you don't need floating-point math for anything else, you may compute an approximate fractional base-2 log pretty easily. Start by shifting your value left until it's 32768 or higher and store the number of times you did that in count. Then, repeat some number of times (depending upon your desired scale factor):
n = (mult(n,n) + 32768u) >> 16; // If a function is available for 16x16->32 multiply
count<<=1;
if (n < 32768) n*=2; else count+=1;
If the above loop is repeated 8 times, then the log base 2 of the number will be count/256. If ten times, count/1024. If eleven, count/2048. Effectively, this function works by computing the integer power-of-two logarithm of n**(2^reps), but with intermediate values scaled to avoid overflow.
Would basic table with interpolation between values approach work? If ranges of values are limited (which is likely for your case - I doubt temperature readings have huge range) and high precisions is not required it may work. Should be easy to test on normal machine.
Here is one of many topics on table representation of functions: Calculating vs. lookup tables for sine value performance?
Necromancing.
I had to implement logarithms on rational numbers.
This is how I did it:
Occording to Wikipedia, there is the Halley-Newton approximation method
which can be used for very-high precision.
Using Newton's method, the iteration simplifies to (implementation), which has cubic convergence to ln(x), which is way better than what the Taylor-Series offers.
// Using Newton's method, the iteration simplifies to (implementation)
// which has cubic convergence to ln(x).
public static double ln(double x, double epsilon)
{
double yn = x - 1.0d; // using the first term of the taylor series as initial-value
double yn1 = yn;
do
{
yn = yn1;
yn1 = yn + 2 * (x - System.Math.Exp(yn)) / (x + System.Math.Exp(yn));
} while (System.Math.Abs(yn - yn1) > epsilon);
return yn1;
}
This is not C, but C#, but I'm sure anybody capable to program in C will be able to deduce the C-Code from that.
Furthermore, since
logn(x) = ln(x)/ln(n).
You have therefore just implemented logN as well.
public static double log(double x, double n, double epsilon)
{
return ln(x, epsilon) / ln(n, epsilon);
}
where epsilon (error) is the minimum precision.
Now as to speed, you're probably better of using the ln-cast-in-hardware, but as I said, I used this as a base to implement logarithms on a rational numbers class working with arbitrary precision.
Arbitrary precision might be more important than speed, under certain circumstances.
Then, use the logarithmic identities for rational numbers:
logB(x/y) = logB(x) - logB(y)
In addition to Crouching Kitten's answer which gave me inspiration, you can build a pseudo-recursive (at most 1 self-call) logarithm to avoid using polynomials. In pseudo code
ln(x) :=
If (x <= 0)
return NaN
Else if (!(1 <= x < 2))
return LN2 * b + ln(a)
Else
return taylor_expansion(x - 1)
This is pretty efficient and precise since on [1; 2) the taylor series converges A LOT faster, and we get such a number 1 <= a < 2 with the first call to ln if our input is positive but not in this range.
You can find 'b' as your unbiased exponent from the data held in the float x, and 'a' from the mantissa of the float x (a is exactly the same float as x, but now with exponent biased_0 rather than exponent biased_b). LN2 should be kept as a macro in hexadecimal floating point notation IMO. You can also use http://man7.org/linux/man-pages/man3/frexp.3.html for this.
Also, the trick
unsigned long tmp = *(ulong*)(&d);
for "memory-casting" double to unsigned long, rather than "value-casting", is very useful to know when dealing with floats memory-wise, as bitwise operators will cause warnings or errors depending on the compiler.
Possible computation of ln(x) and expo(x) in C without <math.h> :
static double expo(double n) {
int a = 0, b = n > 0;
double c = 1, d = 1, e = 1;
for (b || (n = -n); e + .00001 < (e += (d *= n) / (c *= ++a)););
// approximately 15 iterations
return b ? e : 1 / e;
}
static double native_log_computation(const double n) {
// Basic logarithm computation.
static const double euler = 2.7182818284590452354 ;
unsigned a = 0, d;
double b, c, e, f;
if (n > 0) {
for (c = n < 1 ? 1 / n : n; (c /= euler) > 1; ++a);
c = 1 / (c * euler - 1), c = c + c + 1, f = c * c, b = 0;
for (d = 1, c /= 2; e = b, b += 1 / (d * c), b - e/* > 0.0000001 */;)
d += 2, c *= f;
} else b = (n == 0) / 0.;
return n < 1 ? -(a + b) : a + b;
}
static inline double native_ln(const double n) {
// Returns the natural logarithm (base e) of N.
return native_log_computation(n) ;
}
static inline double native_log_base(const double n, const double base) {
// Returns the logarithm (base b) of N.
return native_log_computation(n) / native_log_computation(base) ;
}
Try it Online
Building off #Crouching Kitten's great natural log answer above, if you need it to be accurate for inputs <1 you can add a simple scaling factor. Below is an example in C++ that i've used in microcontrollers. It has a scaling factor of 256 and it's accurate to inputs down to 1/256 = ~0.04, and up to 2^32/256 = 16777215 (due to overflow of a uint32 variable).
It's interesting to note that even on an STMF103 Arm M3 with no FPU, the float implementation below is significantly faster (eg 3x or better) than the 16 bit fixed-point implementation in libfixmath (that being said, this float implementation still takes a few thousand cycles so it's still not ~fast~)
#include <float.h>
float TempSensor::Ln(float y)
{
// Algo from: https://stackoverflow.com/a/18454010
// Accurate between (1 / scaling factor) < y < (2^32 / scaling factor). Read comments below for more info on how to extend this range
float divisor, x, result;
const float LN_2 = 0.69314718; //pre calculated constant used in calculations
uint32_t log2 = 0;
//handle if input is less than zero
if (y <= 0)
{
return -FLT_MAX;
}
//scaling factor. The polynomial below is accurate when the input y>1, therefore using a scaling factor of 256 (aka 2^8) extends this to 1/256 or ~0.04. Given use of uint32_t, the input y must stay below 2^24 or 16777216 (aka 2^(32-8)), otherwise uint_y used below will overflow. Increasing the scaing factor will reduce the lower accuracy bound and also reduce the upper overflow bound. If you need the range to be wider, consider changing uint_y to a uint64_t
const uint32_t SCALING_FACTOR = 256;
const float LN_SCALING_FACTOR = 5.545177444; //this is the natural log of the scaling factor and needs to be precalculated
y = y * SCALING_FACTOR;
uint32_t uint_y = (uint32_t)y;
while (uint_y >>= 1) // Convert the number to an integer and then find the location of the MSB. This is the integer portion of Log2(y). See: https://stackoverflow.com/a/4970859/6630230
{
log2++;
}
divisor = (float)(1 << log2);
x = y / divisor; // FInd the remainder value between [1.0, 2.0] then calculate the natural log of this remainder using a polynomial approximation
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x; //This polynomial approximates ln(x) between [1,2]
result = result + ((float)log2) * LN_2 - LN_SCALING_FACTOR; // Using the log product rule Log(A) + Log(B) = Log(AB) and the log base change rule log_x(A) = log_y(A)/Log_y(x), calculate all the components in base e and then sum them: = Ln(x_remainder) + (log_2(x_integer) * ln(2)) - ln(SCALING_FACTOR)
return result;
}

What is a simple way to find real roots of a (cubic) polynomial?

this seems like an obvious question to me, but I couldn't find it anywhere on SO.
I have a cubic polynomial and I need to find real roots of the function. What is THE way of doing this?
I have found several closed form formulas for roots of a cubic function, but all of them use either complex numbers or lots of goniometric functions and I don't like them (and also don't know which one to choose).
I need something simple; faster is better; and I know that I will eventually need to solve polynomials of higher order, so having a numerical solver would maybe help too.
I know I could use some library to do the hard work for me, but lets say I want to do this as an exercise.
I'm coding in C, so no import magic_poly_solver, please.
Bonus question: How do I find only roots inside a given interval?
For a cubic polynomial there are closed form solutions, but they are not particularly well suited for numerical calculus.
I'd do the following for the cubic case: any cubic polynomial has at least one real root, you can find it easily with Newton's method. Then, you use deflation to get the remaining quadratic polynomial to solve, see my answer there for how to do this latter step correctly.
One word of caution: if the discriminant is close to zero, there will be a numerically multiple real root, and Newton's method will miserably fail. Moreover, since to the vicinity of the root, the polynomial is like (x - x0)^2, you'll lose half your significant digits (since P(x) will be < epsilon as soon as x - x0 < sqrt(epsilon)). So you may want to rule this out and use the closed form solution in this particular case, or solve the derivative polynomial.
If you want to find roots in a given interval, check Sturm's theorem.
A more general (complex) algorithm for generic polynomial solving is Jenkins-Traub algorithm. This is clearly overkill here, but it works well on cubics. Usually, you use a third-party implementation.
Since you do C, using the GSL is surely your best bet.
Another generic method is to find the eigenvalues of the companion matrix with eg. balanced QR decomposition, or reduction to Householder form. This is the approach taken by GSL.
For solving cubic equations with simple C code, I have found the QBC solver by noted numerics expert professor William Kahan to be sufficiently robust, reasonably fast and reasonably accurate:
William Kahan, "To solve a real cubic equation." PAM-352, Center for Pure and Applied Mathematics, University of California, Berkely. November 10, 1986. (online, online)
This uses a derivative-based iterative method to find the real root, reduces to a quadratic equation based on that, finally uses a numerically robust quadratic equation solver to find the two remaining roots. Typically, the iterative solver requires about five to ten iterations to converge to the result. Both solvers can be enhanced for accuracy and performance by judicious use of fused multiply-add (FMA) operations, available in ISO C99 via the fma() standard math function.
Crucial to the accuracy of the quadratic solver is the computation of the discriminant. For this, I use the following code based on recent research:
/* Compute B*B - A*C, accurately
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
"Further Analysis of Kahan's Algorithm for the Accurate Computation
of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284,
Oct. 2013, pp. 2245-2264
https://www.ams.org/journals/mcom/2013-82-284/S0025-5718-2013-02679-8/S0025-5718-2013-02679-8.pdf
*/
double DISC (double A, double B, double C)
{
double w = C * A;
double e = fma (-C, A, w);
double f = fma (B, B, -w);
double r = f + e;
return r;
}
Using double-precision arithmetic, Kahan's solver cannot always produce result accurate to double precision. One of the test cases provided in Kahan's paper illustrates why this is the case:
658x³ - 190125x² + 18311811x - 587898164
Using an arbitrary precision math library, we find that the roots of this cubic equation are as follows:
96.229639346592182_18...
96.357064825184152_07... ± i * 0.069749752043689625_43...
QBC using double-precision arithmetic computes the roots as
96.2296393 50445893
96.35706482 3257289 ± i * 0.0697497 48521837268
The reason for this is that the function evaluation around the real root suffers from errors as large as 60% in the computed function value, preventing the iterative solver from getting closer to the root. By changing the function and derivative evaluation to use double-double computation for intermediate computation (at hefty computational cost), we can address that issue.
/* Data type for double-double computation */
typedef struct {
double l; // low / tail
double h; // high / head
} dbldbl;
dbldbl make_dbldbl (double head, double tail);
double get_dbldbl_head (dbldbl a);
double get_dbldbl_tail (dbldbl a);
dbldbl add_dbldbl (dbldbl a, dbldbl b);
dbldbl mul_dbldbl (dbldbl a, dbldbl b);
void EVAL (double X, double A, double B, double C, double D,
double * restrict Q, double * restrict Qprime,
double * restrict B1, double * restrict C2)
{
#if USE_DBLDBL_EVAL
dbldbl AA, BB, CC, DD, XX, AX, TT, UU;
AA = make_dbldbl (A, 0);
BB = make_dbldbl (B, 0);
CC = make_dbldbl (C, 0);
DD = make_dbldbl (D, 0);
XX = make_dbldbl (X, 0);
AX = mul_dbldbl (AA, XX);
TT = add_dbldbl (AX, BB);
*B1 = get_dbldbl_head (TT) + get_dbldbl_tail(TT);
UU = add_dbldbl (mul_dbldbl (TT, XX), CC);
*C2 = get_dbldbl_head (UU) + get_dbldbl_tail(UU);
TT = add_dbldbl (mul_dbldbl (add_dbldbl (AX, TT), XX), UU);
*Qprime = get_dbldbl_head (TT) + get_dbldbl_tail(TT);
UU = add_dbldbl (mul_dbldbl (UU, XX), DD);
*Q = get_dbldbl_head (UU) + get_dbldbl_tail(UU);
#else // USE_DBLDBL_EVAL
*B1 = fma (A, X, B);
*C2 = fma (*B1, X, C);
*Qprime = fma (fma (A, X, *B1), X, *C2);
*Q = fma (*C2, X, D);
#endif // USE_DBLDBL_EVAL
}
/* Construct new dbldbl number. |tail| must be <= 0.5 ulp of |head| */
dbldbl make_dbldbl (double head, double tail)
{
dbldbl z;
z.l = tail;
z.h = head;
return z;
}
/* Return the head of a double-double number */
double get_dbldbl_head (dbldbl a)
{
return a.h;
}
/* Return the tail of a double-double number */
double get_dbldbl_tail (dbldbl a)
{
return a.l;
}
/* Add two dbldbl numbers */
dbldbl add_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double e, q, r, s, t, u;
/* Andrew Thall, "Extended-Precision Floating-Point Numbers for GPU
Computation." 2006. http://andrewthall.org/papers/df64_qf128.pdf
*/
q = a.h + b.h;
r = q - a.h;
t = (a.h + (r - q)) + (b.h - r);
s = a.l + b.l;
r = s - a.l;
u = (a.l + (r - s)) + (b.l - r);
t = t + s;
s = q + t;
t = (q - s) + t;
t = t + u;
z.h = e = s + t;
z.l = (s - e) + t;
/* For result of zero or infinity, ensure that tail equals head */
if (isinf (s)) {
z.h = s;
z.l = s;
}
if (z.h == 0) {
z.l = z.h;
}
return z;
}
/* Multiply two dbldbl numbers */
dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double e, s, t;
s = a.h * b.h;
t = fma (a.h, b.h, -s);
t = fma (a.l, b.l, t);
t = fma (a.h, b.l, t);
t = fma (a.l, b.h, t);
z.h = e = s + t;
z.l = (s - e) + t;
/* For result of zero or infinity, ensure that tail equals head */
if (isinf (s)) {
z.h = s;
z.l = s;
}
if (z.h == 0) {
z.l = z.h;
}
return z;
}
The roots computed with the more accurate function and derivative evaluation are:
96.22963934659218 0
96.35706482518415 3 ± i * 0.06974975204 5672006
While the real parts are now accurate to within the limits of double precision, the imaginary parts are still off. The reason for this is is that in this case the quadratic equation is sensitive to minute differences in the coefficients. A one ulp error in either of them can cause differences of around 10-11 in the imaginary part. This could be worked around by representing the coefficients to higher than double precision and using higher-precision computation in the quadratic solver.
If you don't want to use the closed from solutions (or expect polynoms of larger order), the most obvious method would be to calculate approximate roots by using Newton's method.
Unfortunately it's not possible to decide which roots you will get when iterating, although it depends on the starting value.
Also see here.
See Solving quartics and cubics for graphics by D Herbison-Evans, published in Graphics Gems V.
/*******************************************************************************
* FindCubicRoots solves:
* coeff[3] * x^3 + coeff[2] * x^2 + coeff[1] * x + coeff[0] = 0
* returns:
* 3 - 3 real roots
* 1 - 1 real root (2 complex conjugate)
*******************************************************************************/
int FindCubicRoots(const FLOAT coeff[4], FLOAT x[3]);
http://www.realitypixels.com/turk/opensource/index.html#CubicRoots

Floating point linear interpolation

To do a linear interpolation between two variables a and b given a fraction f, I'm currently using this code:
float lerp(float a, float b, float f)
{
return (a * (1.0 - f)) + (b * f);
}
I think there's probably a more efficient way of doing it. I'm using a microcontroller without an FPU, so floating point operations are done in software. They are reasonably fast, but it's still something like 100 cycles to add or multiply.
Any suggestions?
n.b. for the sake of clarity in the equation in the code above, we can omit specifying 1.0 as an explicit floating-point literal.
As Jason C points out in the comments, the version you posted is most likely the best choice, due to its superior precision near the edge cases:
float lerp(float a, float b, float f)
{
return a * (1.0 - f) + (b * f);
}
If we disregard from precision for a while, we can simplify the expression as follows:
    a(1 − f) × (b − a)
 = a − af + bf
 = a + f(b − a)
Which means we could write it like this:
float lerp(float a, float b, float f)
{
return a + f * (b - a);
}
In this version we've gotten rid of one multiplication, but lost some precision.
Presuming floating-point math is available, the OP's algorithm is a good one and is always superior to the alternative a + f * (b - a) due to precision loss when a and b significantly differ in magnitude.
For example:
// OP's algorithm
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
// Algebraically simplified algorithm
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
In that example, presuming 32-bit floats lint1(1.0e20, 1.0, 1.0) will correctly return 1.0, whereas lint2 will incorrectly return 0.0.
The majority of precision loss is in the addition and subtraction operators when the operands differ significantly in magnitude. In the above case, the culprits are the subtraction in b - a, and the addition in a + f * (b - a). The OP's algorithm does not suffer from this due to the components being completely multiplied before addition.
For the a=1e20, b=1 case, here is an example of differing results. Test program:
#include <stdio.h>
#include <math.h>
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
int main () {
const float a = 1.0e20;
const float b = 1.0;
int n;
for (n = 0; n <= 1024; ++ n) {
float f = (float)n / 1024.0f;
float p1 = lint1(a, b, f);
float p2 = lint2(a, b, f);
if (p1 != p2) {
printf("%i %.6f %f %f %.6e\n", n, f, p1, p2, p2 - p1);
}
}
return 0;
}
Output, slightly adjusted for formatting:
f lint1 lint2 lint2-lint1
0.828125 17187500894208393216 17187499794696765440 -1.099512e+12
0.890625 10937500768952909824 10937499669441282048 -1.099512e+12
0.914062 8593750447104196608 8593749897348382720 -5.497558e+11
0.945312 5468750384476454912 5468749834720641024 -5.497558e+11
0.957031 4296875223552098304 4296874948674191360 -2.748779e+11
0.972656 2734375192238227456 2734374917360320512 -2.748779e+11
0.978516 2148437611776049152 2148437474337095680 -1.374390e+11
0.986328 1367187596119113728 1367187458680160256 -1.374390e+11
0.989258 1074218805888024576 1074218737168547840 -6.871948e+10
0.993164 683593798059556864 683593729340080128 -6.871948e+10
1.000000 1 0 -1.000000e+00
If you are on a micro-controller without an FPU then floating point is going to be very expensive. Could easily be twenty times slower for a floating point operation. The fastest solution is to just do all the math using integers.
The number of places after the fixed binary point (http://blog.credland.net/2013/09/binary-fixed-point-explanation.html?q=fixed+binary+point) is: XY_TABLE_FRAC_BITS.
Here's a function I use:
inline uint16_t unsignedInterpolate(uint16_t a, uint16_t b, uint16_t position) {
uint32_t r1;
uint16_t r2;
/*
* Only one multiply, and one divide/shift right. Shame about having to
* cast to long int and back again.
*/
r1 = (uint32_t) position * (b-a);
r2 = (r1 >> XY_TABLE_FRAC_BITS) + a;
return r2;
}
With the function inlined it should be approx. 10-20 cycles.
If you've got a 32-bit micro-controller you'll be able to use bigger integers and get larger numbers or more accuracy without compromising performance. This function was used on a 16-bit system.
If you're coding for a microcontroller without floating-point operations, then it's better not to use floating-point numbers at all, and to use fixed-point arithmetic instead.
Since C++20 you can use std::lerp(), which is likely to be the best possible implementation for your target.
It is worth to note, that the standard linear interpolation formulas f1(t)=a+t(b-a), f2(t)=b-(b-a)(1-t), and f3(t)=a(1-t)+bt do not guarantee to be well-behaved when using floating point arithmetic.
Namely, if a != b, it is not guaranteed that the f1(1.0) == b or that f2(0.0) == a, while for a == b, f3(t) is not guaranteed to be equal to a, when 0 < t < 1.
This function has worked for me on processors that support IEEE754 floating point when I need the results to behave well and to hit the endpoints exactly (I use it with double precision, but float should work as well):
double lerp(double a, double b, double t)
{
if (t <= 0.5)
return a+(b-a)*t;
else
return b-(b-a)*(1.0-t);
}
If you want to the final result to be an integer, it might be faster to use integers for the input as well.
int lerp_int(int a, int b, float f)
{
//float diff = (float)(b-a);
//float frac = f*diff;
//return a + (int)frac;
return a + (int)(f * (float)(b-a));
}
This does two casts and one float multiply. If a cast is faster than a float add/subtract on your platform, and if an integer answer is useful to you, this might be a reasonable alternative.

Resources