Floating point linear interpolation - c

To do a linear interpolation between two variables a and b given a fraction f, I'm currently using this code:
float lerp(float a, float b, float f)
{
return (a * (1.0 - f)) + (b * f);
}
I think there's probably a more efficient way of doing it. I'm using a microcontroller without an FPU, so floating point operations are done in software. They are reasonably fast, but it's still something like 100 cycles to add or multiply.
Any suggestions?
n.b. for the sake of clarity in the equation in the code above, we can omit specifying 1.0 as an explicit floating-point literal.

As Jason C points out in the comments, the version you posted is most likely the best choice, due to its superior precision near the edge cases:
float lerp(float a, float b, float f)
{
return a * (1.0 - f) + (b * f);
}
If we disregard from precision for a while, we can simplify the expression as follows:
    a(1 − f) × (b − a)
 = a − af + bf
 = a + f(b − a)
Which means we could write it like this:
float lerp(float a, float b, float f)
{
return a + f * (b - a);
}
In this version we've gotten rid of one multiplication, but lost some precision.

Presuming floating-point math is available, the OP's algorithm is a good one and is always superior to the alternative a + f * (b - a) due to precision loss when a and b significantly differ in magnitude.
For example:
// OP's algorithm
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
// Algebraically simplified algorithm
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
In that example, presuming 32-bit floats lint1(1.0e20, 1.0, 1.0) will correctly return 1.0, whereas lint2 will incorrectly return 0.0.
The majority of precision loss is in the addition and subtraction operators when the operands differ significantly in magnitude. In the above case, the culprits are the subtraction in b - a, and the addition in a + f * (b - a). The OP's algorithm does not suffer from this due to the components being completely multiplied before addition.
For the a=1e20, b=1 case, here is an example of differing results. Test program:
#include <stdio.h>
#include <math.h>
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
int main () {
const float a = 1.0e20;
const float b = 1.0;
int n;
for (n = 0; n <= 1024; ++ n) {
float f = (float)n / 1024.0f;
float p1 = lint1(a, b, f);
float p2 = lint2(a, b, f);
if (p1 != p2) {
printf("%i %.6f %f %f %.6e\n", n, f, p1, p2, p2 - p1);
}
}
return 0;
}
Output, slightly adjusted for formatting:
f lint1 lint2 lint2-lint1
0.828125 17187500894208393216 17187499794696765440 -1.099512e+12
0.890625 10937500768952909824 10937499669441282048 -1.099512e+12
0.914062 8593750447104196608 8593749897348382720 -5.497558e+11
0.945312 5468750384476454912 5468749834720641024 -5.497558e+11
0.957031 4296875223552098304 4296874948674191360 -2.748779e+11
0.972656 2734375192238227456 2734374917360320512 -2.748779e+11
0.978516 2148437611776049152 2148437474337095680 -1.374390e+11
0.986328 1367187596119113728 1367187458680160256 -1.374390e+11
0.989258 1074218805888024576 1074218737168547840 -6.871948e+10
0.993164 683593798059556864 683593729340080128 -6.871948e+10
1.000000 1 0 -1.000000e+00

If you are on a micro-controller without an FPU then floating point is going to be very expensive. Could easily be twenty times slower for a floating point operation. The fastest solution is to just do all the math using integers.
The number of places after the fixed binary point (http://blog.credland.net/2013/09/binary-fixed-point-explanation.html?q=fixed+binary+point) is: XY_TABLE_FRAC_BITS.
Here's a function I use:
inline uint16_t unsignedInterpolate(uint16_t a, uint16_t b, uint16_t position) {
uint32_t r1;
uint16_t r2;
/*
* Only one multiply, and one divide/shift right. Shame about having to
* cast to long int and back again.
*/
r1 = (uint32_t) position * (b-a);
r2 = (r1 >> XY_TABLE_FRAC_BITS) + a;
return r2;
}
With the function inlined it should be approx. 10-20 cycles.
If you've got a 32-bit micro-controller you'll be able to use bigger integers and get larger numbers or more accuracy without compromising performance. This function was used on a 16-bit system.

If you're coding for a microcontroller without floating-point operations, then it's better not to use floating-point numbers at all, and to use fixed-point arithmetic instead.

Since C++20 you can use std::lerp(), which is likely to be the best possible implementation for your target.

It is worth to note, that the standard linear interpolation formulas f1(t)=a+t(b-a), f2(t)=b-(b-a)(1-t), and f3(t)=a(1-t)+bt do not guarantee to be well-behaved when using floating point arithmetic.
Namely, if a != b, it is not guaranteed that the f1(1.0) == b or that f2(0.0) == a, while for a == b, f3(t) is not guaranteed to be equal to a, when 0 < t < 1.
This function has worked for me on processors that support IEEE754 floating point when I need the results to behave well and to hit the endpoints exactly (I use it with double precision, but float should work as well):
double lerp(double a, double b, double t)
{
if (t <= 0.5)
return a+(b-a)*t;
else
return b-(b-a)*(1.0-t);
}

If you want to the final result to be an integer, it might be faster to use integers for the input as well.
int lerp_int(int a, int b, float f)
{
//float diff = (float)(b-a);
//float frac = f*diff;
//return a + (int)frac;
return a + (int)(f * (float)(b-a));
}
This does two casts and one float multiply. If a cast is faster than a float add/subtract on your platform, and if an integer answer is useful to you, this might be a reasonable alternative.

Related

C language, methods to perform simple calculations without using floating point numbers

Are there any methods, code snippets or libraries to perform simple calculations (multiplications, divisions, sums, subtractions) without using floating point numbers?
I code in C on an 8 bits MCU without Floating Point Unit, so floating calculations are very long. I would like to convert all my floating point calculations to integers.
I can allow doing float calculations at initialization for example to calculate integer coefficients from float coefficients.
Scale your integers by using one or two digits as the fractional part. For example, 345 can be interpreted as 3.45. All arithmetic is (almost) exactly the same.
Example:
int mult(int a, int b)
{
long long result = ((long long)a * b) / 100;
if(result > INT_MAX || result < INT_MIN) { /* do something */}
return result;
}
int divide(int a, int b)
{
return b ? ((long long)a * 100) / b : 0;
}
int main(void)
{
int result = divide(500, 3000);
printf("%d.%d\n", result / 100, result % 100);
}

Why is the function floor giving different results in this case?

In this example, the behaviour of floor differs and I do not understand why:
printf("floor(34000000.535 * 100 + 0.5) : %lf \n", floor(34000000.535 * 100 + 0.5));
printf("floor(33000000.535 * 100 + 0.5) : %lf \n", floor(33000000.535 * 100 + 0.5));
The output for this code is:
floor(34000000.535 * 100 + 0.5) : 3400000053.000000
floor(33000000.535 * 100 + 0.5) : 3300000054.000000
Why does the first result not equal to 3400000054.0 as we could expect?
double in C does not represent every possible number that can be expressed in text.
double can typically represent about 264 different numbers. Neither 34000000.535 nor 33000000.535 are in that set when double is encoded as a binary floating point number. Instead the closest representable number is used.
Text 34000000.535
closest double 34000000.534999996423...
Text 33000000.535
closest double 33000000.535000000149...
With double as a binary floating point number, multiplying by a non-power-of-2, like 100.0, can introduce additional rounding differences. Yet in these cases, it still results in products, one just above xxx.5 and another below.
Adding 0.5, a simple power of 2, does not incurring rounding issues as the value is not extreme compared to 3x00000053.5.
Seeing intermediate results to higher print precision well shows the typical step-by-step process.
#include <stdio.h>
#include <float.h>
#include <math.h>
void fma_test(double a, double b, double c) {
int n = DBL_DIG + 3;
printf("a b c %.*e %.*e %.*e\n", n, a, n, b, n, c);
printf("a*b %.*e\n", n, a*b);
printf("a*b+c %.*e\n", n, a*b+c);
printf("a*b+c %.*e\n", n, floor(a*b+c));
puts("");
}
int main(void) {
fma_test(34000000.535, 100, 0.5);
fma_test(33000000.535, 100, 0.5);
}
Output
a b c 3.400000053499999642e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.400000053499999523e+09
a*b+c 3.400000053999999523e+09
a*b+c 3.400000053000000000e+09
a b c 3.300000053500000015e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.300000053500000000e+09
a*b+c 3.300000054000000000e+09
a*b+c 3.300000054000000000e+09
The issue is more complex then this simple answers as various platforms can 1) use higher precision math like long double or 2) rarely, use a decimal floating point double. So code's results may vary.
Question has been already answered here.
In basic float numbers are just approximation. If we have program like this:
float a = 0.2 + 0.3;
float b = 0.25 + 0.25;
if (a == b) {
//might happen
}
if (a != b) {
// also might happen
}
The only guaranteed thing is that a-b is relatively small.

How will you implement pow(a,b) in C ? condition follows --

without using multiplication or division operators.
You can use only add/substract operators.
A pointless problem, but solvable with the properties of logarithms:
pow(a,b) = exp( b * log(a) )
= exp( exp(log(b) + log(log(a)) )
Take care to insure that your exponential and logarithm functions are using the same base.
Yes, I know how to use a sliderule. Learning that trick will change your perspective of logarithms.
If they are integers, it's simple to turn pow (a, b) into b multiplications of a.
pow(a, b) = a * a * a * a ... ; // do this b times
And simple to turn a * a into additions
a * a = a + a + a + a + ... ; // do this a times
If you combine them, you can make pow.
First, make mult(int a, int b), then use it to make pow.
A recursive solution :
#include<stdio.h>
int multiplication(int a1, int b1)
{
if(b1)
return (a1 + multiplication(a1, b1-1));
else
return 0;
}
int pow(int a, int b)
{
if(b)
return multiplication(a, pow(a, b-1));
else
return 1;
}
int main()
{
printf("\n %d", pow(5, 4));
}
You've already gotten answers purely for FP and purely for integers. Here's one for a FP number raised to an integer power:
double power(double x, int y) {
double z = 1.0;
while (y > 0) {
while (!(y&1)) {
y >>= 2;
x *= x;
}
--y;
z = x * z;
}
return z;
}
At the moment this uses multiplication. You can implement multiplication using only bit shifts, a few bit comparisons, and addition. For integers it looks like this:
int mul(int x, int y) {
int result = 0;
while (y) {
if (y&1)
result += x;
x <<= 1;
y >>= 1;
}
return result;
}
Floating point is pretty much the same, except you have to normalize your results -- i.e., in essence, a floating point number is 1) a significand expressed as a (usually fairly large) integer, and 2) a scale factor. If you want to produce normal IEEE floating point numbers a few parts get a bit ugly though -- for example, the scale factor is stored as a "bias" number instead of any of the usual 1's complement, 2's complement, etc., so working with it is clumsy (basically, each operation you subtract off the bias, do the operation, check for overflow, and (assuming it hasn't overflowed) add the bias back on again).
Doing the job without any kind of logical tests sounds (to me) like it probably wasn't really intended. For quite a few computer architecture classes, it's interesting to reduce a problem to primitive operations you can express directly in hardware (e.g., bit shifts, bitwise-AND, -OR and -NOT, etc.) The implementation shown above fits that reasonably well (if you want to get technical, an adder takes a few gates, but VHDL, Verilog, etc., but it's included in things like VHDL and Verilog anyway).

What is a simple way to find real roots of a (cubic) polynomial?

this seems like an obvious question to me, but I couldn't find it anywhere on SO.
I have a cubic polynomial and I need to find real roots of the function. What is THE way of doing this?
I have found several closed form formulas for roots of a cubic function, but all of them use either complex numbers or lots of goniometric functions and I don't like them (and also don't know which one to choose).
I need something simple; faster is better; and I know that I will eventually need to solve polynomials of higher order, so having a numerical solver would maybe help too.
I know I could use some library to do the hard work for me, but lets say I want to do this as an exercise.
I'm coding in C, so no import magic_poly_solver, please.
Bonus question: How do I find only roots inside a given interval?
For a cubic polynomial there are closed form solutions, but they are not particularly well suited for numerical calculus.
I'd do the following for the cubic case: any cubic polynomial has at least one real root, you can find it easily with Newton's method. Then, you use deflation to get the remaining quadratic polynomial to solve, see my answer there for how to do this latter step correctly.
One word of caution: if the discriminant is close to zero, there will be a numerically multiple real root, and Newton's method will miserably fail. Moreover, since to the vicinity of the root, the polynomial is like (x - x0)^2, you'll lose half your significant digits (since P(x) will be < epsilon as soon as x - x0 < sqrt(epsilon)). So you may want to rule this out and use the closed form solution in this particular case, or solve the derivative polynomial.
If you want to find roots in a given interval, check Sturm's theorem.
A more general (complex) algorithm for generic polynomial solving is Jenkins-Traub algorithm. This is clearly overkill here, but it works well on cubics. Usually, you use a third-party implementation.
Since you do C, using the GSL is surely your best bet.
Another generic method is to find the eigenvalues of the companion matrix with eg. balanced QR decomposition, or reduction to Householder form. This is the approach taken by GSL.
For solving cubic equations with simple C code, I have found the QBC solver by noted numerics expert professor William Kahan to be sufficiently robust, reasonably fast and reasonably accurate:
William Kahan, "To solve a real cubic equation." PAM-352, Center for Pure and Applied Mathematics, University of California, Berkely. November 10, 1986. (online, online)
This uses a derivative-based iterative method to find the real root, reduces to a quadratic equation based on that, finally uses a numerically robust quadratic equation solver to find the two remaining roots. Typically, the iterative solver requires about five to ten iterations to converge to the result. Both solvers can be enhanced for accuracy and performance by judicious use of fused multiply-add (FMA) operations, available in ISO C99 via the fma() standard math function.
Crucial to the accuracy of the quadratic solver is the computation of the discriminant. For this, I use the following code based on recent research:
/* Compute B*B - A*C, accurately
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
"Further Analysis of Kahan's Algorithm for the Accurate Computation
of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284,
Oct. 2013, pp. 2245-2264
https://www.ams.org/journals/mcom/2013-82-284/S0025-5718-2013-02679-8/S0025-5718-2013-02679-8.pdf
*/
double DISC (double A, double B, double C)
{
double w = C * A;
double e = fma (-C, A, w);
double f = fma (B, B, -w);
double r = f + e;
return r;
}
Using double-precision arithmetic, Kahan's solver cannot always produce result accurate to double precision. One of the test cases provided in Kahan's paper illustrates why this is the case:
658x³ - 190125x² + 18311811x - 587898164
Using an arbitrary precision math library, we find that the roots of this cubic equation are as follows:
96.229639346592182_18...
96.357064825184152_07... ± i * 0.069749752043689625_43...
QBC using double-precision arithmetic computes the roots as
96.2296393 50445893
96.35706482 3257289 ± i * 0.0697497 48521837268
The reason for this is that the function evaluation around the real root suffers from errors as large as 60% in the computed function value, preventing the iterative solver from getting closer to the root. By changing the function and derivative evaluation to use double-double computation for intermediate computation (at hefty computational cost), we can address that issue.
/* Data type for double-double computation */
typedef struct {
double l; // low / tail
double h; // high / head
} dbldbl;
dbldbl make_dbldbl (double head, double tail);
double get_dbldbl_head (dbldbl a);
double get_dbldbl_tail (dbldbl a);
dbldbl add_dbldbl (dbldbl a, dbldbl b);
dbldbl mul_dbldbl (dbldbl a, dbldbl b);
void EVAL (double X, double A, double B, double C, double D,
double * restrict Q, double * restrict Qprime,
double * restrict B1, double * restrict C2)
{
#if USE_DBLDBL_EVAL
dbldbl AA, BB, CC, DD, XX, AX, TT, UU;
AA = make_dbldbl (A, 0);
BB = make_dbldbl (B, 0);
CC = make_dbldbl (C, 0);
DD = make_dbldbl (D, 0);
XX = make_dbldbl (X, 0);
AX = mul_dbldbl (AA, XX);
TT = add_dbldbl (AX, BB);
*B1 = get_dbldbl_head (TT) + get_dbldbl_tail(TT);
UU = add_dbldbl (mul_dbldbl (TT, XX), CC);
*C2 = get_dbldbl_head (UU) + get_dbldbl_tail(UU);
TT = add_dbldbl (mul_dbldbl (add_dbldbl (AX, TT), XX), UU);
*Qprime = get_dbldbl_head (TT) + get_dbldbl_tail(TT);
UU = add_dbldbl (mul_dbldbl (UU, XX), DD);
*Q = get_dbldbl_head (UU) + get_dbldbl_tail(UU);
#else // USE_DBLDBL_EVAL
*B1 = fma (A, X, B);
*C2 = fma (*B1, X, C);
*Qprime = fma (fma (A, X, *B1), X, *C2);
*Q = fma (*C2, X, D);
#endif // USE_DBLDBL_EVAL
}
/* Construct new dbldbl number. |tail| must be <= 0.5 ulp of |head| */
dbldbl make_dbldbl (double head, double tail)
{
dbldbl z;
z.l = tail;
z.h = head;
return z;
}
/* Return the head of a double-double number */
double get_dbldbl_head (dbldbl a)
{
return a.h;
}
/* Return the tail of a double-double number */
double get_dbldbl_tail (dbldbl a)
{
return a.l;
}
/* Add two dbldbl numbers */
dbldbl add_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double e, q, r, s, t, u;
/* Andrew Thall, "Extended-Precision Floating-Point Numbers for GPU
Computation." 2006. http://andrewthall.org/papers/df64_qf128.pdf
*/
q = a.h + b.h;
r = q - a.h;
t = (a.h + (r - q)) + (b.h - r);
s = a.l + b.l;
r = s - a.l;
u = (a.l + (r - s)) + (b.l - r);
t = t + s;
s = q + t;
t = (q - s) + t;
t = t + u;
z.h = e = s + t;
z.l = (s - e) + t;
/* For result of zero or infinity, ensure that tail equals head */
if (isinf (s)) {
z.h = s;
z.l = s;
}
if (z.h == 0) {
z.l = z.h;
}
return z;
}
/* Multiply two dbldbl numbers */
dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double e, s, t;
s = a.h * b.h;
t = fma (a.h, b.h, -s);
t = fma (a.l, b.l, t);
t = fma (a.h, b.l, t);
t = fma (a.l, b.h, t);
z.h = e = s + t;
z.l = (s - e) + t;
/* For result of zero or infinity, ensure that tail equals head */
if (isinf (s)) {
z.h = s;
z.l = s;
}
if (z.h == 0) {
z.l = z.h;
}
return z;
}
The roots computed with the more accurate function and derivative evaluation are:
96.22963934659218 0
96.35706482518415 3 ± i * 0.06974975204 5672006
While the real parts are now accurate to within the limits of double precision, the imaginary parts are still off. The reason for this is is that in this case the quadratic equation is sensitive to minute differences in the coefficients. A one ulp error in either of them can cause differences of around 10-11 in the imaginary part. This could be worked around by representing the coefficients to higher than double precision and using higher-precision computation in the quadratic solver.
If you don't want to use the closed from solutions (or expect polynoms of larger order), the most obvious method would be to calculate approximate roots by using Newton's method.
Unfortunately it's not possible to decide which roots you will get when iterating, although it depends on the starting value.
Also see here.
See Solving quartics and cubics for graphics by D Herbison-Evans, published in Graphics Gems V.
/*******************************************************************************
* FindCubicRoots solves:
* coeff[3] * x^3 + coeff[2] * x^2 + coeff[1] * x + coeff[0] = 0
* returns:
* 3 - 3 real roots
* 1 - 1 real root (2 complex conjugate)
*******************************************************************************/
int FindCubicRoots(const FLOAT coeff[4], FLOAT x[3]);
http://www.realitypixels.com/turk/opensource/index.html#CubicRoots

How can I get the best accurate result?

Given:
unsigned int a, b, c, d;
I want:
d = a * b / c;
and (a *b ) may overflow; also (b/c) may equal zero and give less accuracy.
Maybe a cast to 64-bits would get things to work, but I want to know the best way to get the most accurate result in d.
Is there any good solution?
I would either:
Cast to 64 bits, if that will work for your ranges of a, b, and c.
Use an infinite precision library like GMP
Cast to a float or double and back, if you find those results acceptable.
For best accuracy/precision you'll want to do your multiplies before your divides. As you imply, you'll want to use something with twice as many bits as an int:
int64_t d = (int64_t) a * (int64_t) b;
d /= c;
You don't need both casts, but they arguably make it a bit clearer.
Note that if c is small enough, then d can still be bigger than an int. That may or may not be an issue for you. If you're sure it isn't you can cast down to an int at the end.
For your problem as stated, I'd do d = (long long)a * b / c;
No sense in going to float when you only need more bits. No need to redeclare or cast everything. Casting a is enough to promote b and c to larger size in the expression.
Use a float or double, in floating-point arithmetic, division by zero is allowed, results will be a positive or negative infinity
You can always do an explicit check for overflow on a * b:
long long e = (long long) a * (long long) b;
if (e <= INT_MAX) {
d = e / c;
} else {
d = a * (b / c);
}
Of course this only works for non-negative a, b, c. If they can be negative you'll also have to check against INT_MIN.
[Update] You could also check which of a and b is larger and thus loses less precision when divided by c:
if (a >= b) {
d = a / c * b;
} else {
d = a * (b / c);
}
Why not use a float or double? A float (on intel chips) is a 32-bit floating-point number, so you wouldn't necessarily need 64 bits for the operation?
I'd do something along the lines of the following:
if(c){
d = (long long)a * b;
d /= c;
}
else{
// some error code because div by 0 is not allowed
}

Resources