In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive variable argument and K is a constant. In many cases, K is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma() and fmaf().
For ease of exploration, I am experimenting with float arithmetic. Since I plan to port the approach to double arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a in the interval [K/2, 4.23*K], code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K is a power of 2, and there is no overflow or underflow in intermediate results. For K not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K = 2n: When the upper bound of the work interval increases to 8*K, 16*K, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo, i.e. similar to the well-known double-float and double-double formats. In my code above, this would mean changing the last line to qlo = r * e.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow(). But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2 (for single precision) or K=4 (for double precision) to keep the primary approximation interval narrow, and the interval for a is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m is computed exactly if a ∈ [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
This is similar to a simplified form of the div2 algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m as the sum of 2 floats, or use a double).
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k and a-k.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits
in the input
test code
https://ideone.com/bHxAg8
The problem is the addition in (a + K). Any loss of precision in (a + K) is magnified by the division. The problem isn't the division itself.
If the exponents of a and K are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a (if a has larger magnitude) or (a + K) == K (if K has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest (where smallest is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp; won't help much either because you've still got exactly the same (a + K) problem (but it would avoid a similar problem in (a - K)). Also you can't do result = anything / p + anything_error/p_error because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K); then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;. Next do a_lower = a and decrease a_lower by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound, then revert back to the previous value of a_lower. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound and a_upper (starting with the original value of a). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)
Related
The inverse hyperbolic function asinh() is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh() from the C99 standard math function log1p(). For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf() and log1pf(). I intend to re-use the exact same algorithm for double precision computation, i.e. asinh() and log1p(), later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf() (reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf(), but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf() implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0 will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1 could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf() may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.
Firstly, you may want to look into the accuracy and speed of your log1pf function: these can vary a bit between libms (I've found the OS X math functions to be fast, the glibc ones to be slower but typically correctly rounded).
Openlibm, based on the BSD libm, which in turn is based on Sun's fdlibm, use multiple approaches by range, but the main bit is the relation:
t = x*x;
w = log1pf(fabsf(x)+t/(one+sqrtf(one+t)));
You may also want to try compiling with the -fno-math-errno option, which disables the old System V error codes for sqrt (IEEE-754 exceptions will still work).
After various additional experiments, I have convinced myself that a simple argument transformation that does not use higher precision than the argument and result cannot achieve a tighter error bound than the one achieved by the first variant in the code I posted.
Since my question is about minimizing the error for the argument transformation which is incurred in addition to the error in log1pf() itself, the most straightforward approach to use for experimentation is to utilize a correctly rounded implementation of that logarithm function. Note that a correctly-rounded implementation is highly unlikely to exist in the context of a high-performance environment. According to the works of J.-M. Muller et. al., to produce accurate single-precision results, x86 extended precision computation should be sufficient, for example
float accurate_log1pf (float a)
{
float res;
__asm fldln2;
__asm fld dword ptr [a];
__asm fyl2xp1;
__asm fst dword ptr [res];
__asm fcompp;
return res;
}
An implementation of asinhf() using the first variant from my question then looks as follows:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
Testing with all 232 IEEE-754 single-precision operands shows that the maximum error of 1.49486070 ulp occurs at ±0x1.ff5022p-9 and there are 353,521,140 incorrectly rounded results. What happens if the entire argument transformation uses double-precision arithmetic? The code changes to
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt;
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
However, the error bound does not improve with this change! The maximum error of 1.49486070 ulp still occurs at ±0x1.ff5022p-9 and there are now 350,971,046 incorrectly rounded results, slightly fewer than before. The issue seems to be that a float operand cannot convey enough information to log1pf() to produce more accurate results. A similar problem occurs when computing sinf() and cosf(). If the reduced argument, represented as a correctly rounded float operand, is passed to the core polynomials, the resulting error in sinf() and cosf() is just a tad under 1.5 ulp, just as we are observing here with my_asinhf().
One solution is to compute the transformed argument to higher than single precision, for example as a double-float operand pair (a useful brief overwiew of double-float techniques can be found in this paper by Andrew Thall). In this case, we can use the additional information to perform linear interpolation on the result, based on the knowledge that the derivative of the logarithm is the reciprocal. This gives us:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt; // "head" of double-float
s = (float)(tt - (double)t); // "tail" of double-float
t = fmaf (s, 1.0f / (1.0f + t), accurate_log1pf (t)); // interpolate
}
return copysignf (t, a); // restore sign
}
Exhaustive test of this version indicates that the maximum error has been reduced to 0.99999948 ulp, it occurs at ±0x1.deeea0p-22. There are 349,653,534 incorrectly rounded results. A faithfully-rounded implementation of asinhf() has been achieved.
Unfortunately, the practical utility of this result is limited. Depending on HW platform, the throughput of arithmetic operations on double may only be 1/2 to 1/32 of the throughput of float operations. The double-precision computation can be replaced with double-float computation, but this would incur very significant cost as well. Lastly, my approach here was to use the single-precision implementation as a proving ground for subsequent double-precision work, and many hardware platforms (certainly all the ones I am interested in) do not offer hardware support for a numeric format with higher precision than IEEE-754 binary64 (double precision). Therefore any solution should not require higher-precision arithmetic in intermediate computation.
Since all the troublesome arguments in the case of asinhf() are small in magnitude, one could [partially?] address the accuracy issue by using a polynomial minimax approximation for the region around the origin. As this would create another code branch, it would likely make vectorization more difficult.
Splines (the piecewise cubic polynomial form) can be written as:
s = x - x[k]
y = y[k] + a[k]*s + b[k]*s*s + c[k]*s*s*s
where x[k] < x < x[k+1], the curve passes through each (x[k], y[k]) point, and a,b,c are arrays of coefficients describing the slope and shape. This all works fine in floating point, and there are plenty of ways to calculate a,b,c for different kinds of splines. However...
How can this be approximated in integer arithmetic?
One of the tricky parts is that any approximation should, ideally, be continuous, in other words using x=x[k+1] and the coefficients from the k-th segment, the result should be y[k+1] except for rounding errors. In other words, for a straight segment, y[k+1] == y[k] + a[k]*(x[k+1] - x[k]), and curvy segments only deviate from this in the middle but not at either end. This is guaranteed by construction in the case of floating point, but even a small coefficient change from rounding can throw it off quite a bit.
Another tricky part is that, in general, the magnitude of the higher-order coefficients is much smaller - but not always, esp. not at sharp "corners". It may still make sense to scale them up by the typical size of s to the power of whatever order they are, so they are not rounded of to zero as integers, but that would seem to trade off resolution in curvature for max possible corner sharpness.
First try at an integer version:
y = y[k] + (a[k] + (b[k] + c[k]*s)*s)*s
Then use integer multiply (intended for 16bit values, 32bit arithmetic):
#define q (1<<16)
#define mult(x, y) ((x * y) / q)
y = y[k] + mult(mult(mult(c[k], s) + b[k], s) + a[k], s)
This looks good in theory, but I'm not sure it's the best possible approach, or how to tell systematically what the best possible approach is.
I have came across this problem many time but I am unable to solve it. There would occur some cases or the other which will wrong answer or otherwise the program I write will be too slow. Formally I am talking about calculating
nCk mod p where p is a prime n is a large number, and 1<=k<=n.
What have I tried:
I know the recursive formulation of factorial and then modelling it as a dynamic programming problem, but I feel that it is slow. The recursive formulation is (nCk) + (nCk-1) = (n+1Ck). I took care of the modulus while storing values in array to avoid overflows but I am not sure that just doing a mod p on the result will avoid all overflows as it may happen that one needs to remove.
To compute nCr, there's a simple algorithm based on the rule nCr = (n - 1)C(r - 1) * n / r:
def nCr(n,r):
if r == 0:
return 1
return n * nCr(n - 1, r - 1) // r
Now in modulo arithmetic we don't quite have division, but we have modulo inverses which (when modding by a prime) are just as good
def nCrModP(n, r, p):
if r == 0:
return 1
return n * nCrModP(n - 1, r - 1) * modinv(r, p) % p
Here's one implementation of modinv on rosettacode
Not sure what you mean by "storing values in array", but I assume they array serves as a lookup table while running to avoid redundant calculations to speed things up. This should take care of the speed problem. Regarding the overflows - you can perform the modulo operation at any stage of computation and repeat it as much as you want - the result will be correct.
First, let's work with the case where p is relatively small.
Take the base-p expansions of n and k: write n = n_0 + n_1 p + n_2 p^2 + ... + n_m p^m and k = k_0 + k_1 p + ... + k_m p^m where each n_i and each k_i is at least 0 but less than p. A theorem (which I think is due to Edouard Lucas) states that C(n,k) = C(n_0, k_0) * C(n_1, k_1) * ... * C(n_m, k_m). This reduces to taking a mod-p product of numbers in the "n is relatively small" case below.
Second, if n is relatively small, you can just compute binomial coefficients using dynamic programming on the formula C(n,k) = C(n-1,k-1) + C(n-1,k), reducing mod p at each step. Or do something more clever.
Third, if k is relatively small (and less than p), you should be able to compute n!/(k!(n-k)!) mod p by computing n!/(n-k)! as n * (n-1) * ... * (n-k+1), reducing modulo p after each product, then multiplying by the modular inverses of each number between 1 and k.
I am using the following smoothing function to smooth speed readings from GPS.
void smoothing_init()
{
k = 0;
kalman_init(0.0625, 32, 1.3833094, 0);
}
void kalman_init(double _q, double _r, double _p, double intial_value)
{
q = _q;
r = _r;
p = _p;
x = intial_value;
}
double smoothing_add_sample(double measurement)
{
p = p + q;
k = p / (p + r);
x = x + k * (measurement - x);
p = (1 - k) * p;
return x;
}
However, sometimes this gives me smoothing values 700(the normal range is 0-150) and then going down. I guess it happens when I initialize routine with 0 but immediately receiving reading above 0 (for example 40, 50).
How can I tweak those functions to naturally prevent such spikes, but still be able to smooth data.
The Kalman filter is an estimator, which minimizes the variance (p in your code) of a system state (x) of a linear system. The variance p can be interpreted as a measure of confidence in the value of x.
For each time step k the filter performs two steps:
The prediction step propagates the estimate one time step ahead (in your case according to x[k+1] = x[k] + w[k] where w[k] is zero-mean Gaussian noise with variance q). It means here that the state variance is increased by q.
The filtering step incorporates measurement information (according to measurement[k] = x[k] + v[k] where v[k] is zero-mean Gaussian noise with variance r).
For classic estimation problems p is initialized with a very large value (little confidence in the initial value x). Over time p decreases to a value somewhere around q. Note the p is independent of the measurement, so it only depends on q,r and p[0].
So for the tweaking:
Initialize with a large p, i.e. p/r>>0; the initialization value of x does not really matter.
The quotient r/q adjusts how strong the smoothing is (r/q=0 then there's no smoothing at all).
To choose the magnitude of r,q, you can assume Gaussian noise: Then you have a 95% probability the true value lies in the confidence interval of [x-2*sqrt(p), x+2*sqrt(p)].
I'm looking for implementation of log() and exp() functions provided in C library <math.h>. I'm working with 8 bit microcontrollers (OKI 411 and 431). I need to calculate Mean Kinetic Temperature. The requirement is that we should be able to calculate MKT as fast as possible and with as little code memory as possible. The compiler comes with log() and exp() functions in <math.h>. But calling either function and linking with the library causes the code size to increase by 5 Kilobytes, which will not fit in one of the micro we work with (OKI 411), because our code already consumed ~12K of available ~15K code memory.
The implementation I'm looking for should not use any other C library functions (like pow(), sqrt() etc). This is because all library functions are packed in one library and even if one function is called, the linker will bring whole 5K library to code memory.
EDIT
The algorithm should be correct up to 3 decimal places.
Using Taylor series is not the simplest neither the fastest way of doing this. Most professional implementations are using approximating polynomials. I'll show you how to generate one in Maple (it is a computer algebra program), using the Remez algorithm.
For 3 digits of accuracy execute the following commands in Maple:
with(numapprox):
Digits := 8
minimax(ln(x), x = 1 .. 2, 4, 1, 'maxerror')
maxerror
Its response is the following polynomial:
-1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x
With the maximal error of: 0.000061011436
We generated a polynomial which approximates the ln(x), but only inside the [1..2] interval. Increasing the interval is not wise, because that would increase the maximal error even more. Instead of that, do the following decomposition:
So first find the highest power of 2, which is still smaller than the number (See: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?). That number is actually the base-2 logarithm. Divide with that value, then the result gets into the 1..2 interval. At the end we will have to add n*ln(2) to get the final result.
An example implementation for numbers >= 1:
float ln(float y) {
int log2;
float divisor, x, result;
log2 = msb((int)y); // See: https://stackoverflow.com/a/4970859/6630230
divisor = (float)(1 << log2);
x = y / divisor; // normalized value between [1.0, 2.0]
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
result += ((float)log2) * 0.69314718; // ln(2) = 0.69314718
return result;
}
Although if you plan to use it only in the [1.0, 2.0] interval, then the function is like:
float ln(float x) {
return -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
}
The Taylor series for e^x converges extremely quickly, and you can tune your implementation to the precision that you need. (http://en.wikipedia.org/wiki/Taylor_series)
The Taylor series for log is not as nice...
If you don't need floating-point math for anything else, you may compute an approximate fractional base-2 log pretty easily. Start by shifting your value left until it's 32768 or higher and store the number of times you did that in count. Then, repeat some number of times (depending upon your desired scale factor):
n = (mult(n,n) + 32768u) >> 16; // If a function is available for 16x16->32 multiply
count<<=1;
if (n < 32768) n*=2; else count+=1;
If the above loop is repeated 8 times, then the log base 2 of the number will be count/256. If ten times, count/1024. If eleven, count/2048. Effectively, this function works by computing the integer power-of-two logarithm of n**(2^reps), but with intermediate values scaled to avoid overflow.
Would basic table with interpolation between values approach work? If ranges of values are limited (which is likely for your case - I doubt temperature readings have huge range) and high precisions is not required it may work. Should be easy to test on normal machine.
Here is one of many topics on table representation of functions: Calculating vs. lookup tables for sine value performance?
Necromancing.
I had to implement logarithms on rational numbers.
This is how I did it:
Occording to Wikipedia, there is the Halley-Newton approximation method
which can be used for very-high precision.
Using Newton's method, the iteration simplifies to (implementation), which has cubic convergence to ln(x), which is way better than what the Taylor-Series offers.
// Using Newton's method, the iteration simplifies to (implementation)
// which has cubic convergence to ln(x).
public static double ln(double x, double epsilon)
{
double yn = x - 1.0d; // using the first term of the taylor series as initial-value
double yn1 = yn;
do
{
yn = yn1;
yn1 = yn + 2 * (x - System.Math.Exp(yn)) / (x + System.Math.Exp(yn));
} while (System.Math.Abs(yn - yn1) > epsilon);
return yn1;
}
This is not C, but C#, but I'm sure anybody capable to program in C will be able to deduce the C-Code from that.
Furthermore, since
logn(x) = ln(x)/ln(n).
You have therefore just implemented logN as well.
public static double log(double x, double n, double epsilon)
{
return ln(x, epsilon) / ln(n, epsilon);
}
where epsilon (error) is the minimum precision.
Now as to speed, you're probably better of using the ln-cast-in-hardware, but as I said, I used this as a base to implement logarithms on a rational numbers class working with arbitrary precision.
Arbitrary precision might be more important than speed, under certain circumstances.
Then, use the logarithmic identities for rational numbers:
logB(x/y) = logB(x) - logB(y)
In addition to Crouching Kitten's answer which gave me inspiration, you can build a pseudo-recursive (at most 1 self-call) logarithm to avoid using polynomials. In pseudo code
ln(x) :=
If (x <= 0)
return NaN
Else if (!(1 <= x < 2))
return LN2 * b + ln(a)
Else
return taylor_expansion(x - 1)
This is pretty efficient and precise since on [1; 2) the taylor series converges A LOT faster, and we get such a number 1 <= a < 2 with the first call to ln if our input is positive but not in this range.
You can find 'b' as your unbiased exponent from the data held in the float x, and 'a' from the mantissa of the float x (a is exactly the same float as x, but now with exponent biased_0 rather than exponent biased_b). LN2 should be kept as a macro in hexadecimal floating point notation IMO. You can also use http://man7.org/linux/man-pages/man3/frexp.3.html for this.
Also, the trick
unsigned long tmp = *(ulong*)(&d);
for "memory-casting" double to unsigned long, rather than "value-casting", is very useful to know when dealing with floats memory-wise, as bitwise operators will cause warnings or errors depending on the compiler.
Possible computation of ln(x) and expo(x) in C without <math.h> :
static double expo(double n) {
int a = 0, b = n > 0;
double c = 1, d = 1, e = 1;
for (b || (n = -n); e + .00001 < (e += (d *= n) / (c *= ++a)););
// approximately 15 iterations
return b ? e : 1 / e;
}
static double native_log_computation(const double n) {
// Basic logarithm computation.
static const double euler = 2.7182818284590452354 ;
unsigned a = 0, d;
double b, c, e, f;
if (n > 0) {
for (c = n < 1 ? 1 / n : n; (c /= euler) > 1; ++a);
c = 1 / (c * euler - 1), c = c + c + 1, f = c * c, b = 0;
for (d = 1, c /= 2; e = b, b += 1 / (d * c), b - e/* > 0.0000001 */;)
d += 2, c *= f;
} else b = (n == 0) / 0.;
return n < 1 ? -(a + b) : a + b;
}
static inline double native_ln(const double n) {
// Returns the natural logarithm (base e) of N.
return native_log_computation(n) ;
}
static inline double native_log_base(const double n, const double base) {
// Returns the logarithm (base b) of N.
return native_log_computation(n) / native_log_computation(base) ;
}
Try it Online
Building off #Crouching Kitten's great natural log answer above, if you need it to be accurate for inputs <1 you can add a simple scaling factor. Below is an example in C++ that i've used in microcontrollers. It has a scaling factor of 256 and it's accurate to inputs down to 1/256 = ~0.04, and up to 2^32/256 = 16777215 (due to overflow of a uint32 variable).
It's interesting to note that even on an STMF103 Arm M3 with no FPU, the float implementation below is significantly faster (eg 3x or better) than the 16 bit fixed-point implementation in libfixmath (that being said, this float implementation still takes a few thousand cycles so it's still not ~fast~)
#include <float.h>
float TempSensor::Ln(float y)
{
// Algo from: https://stackoverflow.com/a/18454010
// Accurate between (1 / scaling factor) < y < (2^32 / scaling factor). Read comments below for more info on how to extend this range
float divisor, x, result;
const float LN_2 = 0.69314718; //pre calculated constant used in calculations
uint32_t log2 = 0;
//handle if input is less than zero
if (y <= 0)
{
return -FLT_MAX;
}
//scaling factor. The polynomial below is accurate when the input y>1, therefore using a scaling factor of 256 (aka 2^8) extends this to 1/256 or ~0.04. Given use of uint32_t, the input y must stay below 2^24 or 16777216 (aka 2^(32-8)), otherwise uint_y used below will overflow. Increasing the scaing factor will reduce the lower accuracy bound and also reduce the upper overflow bound. If you need the range to be wider, consider changing uint_y to a uint64_t
const uint32_t SCALING_FACTOR = 256;
const float LN_SCALING_FACTOR = 5.545177444; //this is the natural log of the scaling factor and needs to be precalculated
y = y * SCALING_FACTOR;
uint32_t uint_y = (uint32_t)y;
while (uint_y >>= 1) // Convert the number to an integer and then find the location of the MSB. This is the integer portion of Log2(y). See: https://stackoverflow.com/a/4970859/6630230
{
log2++;
}
divisor = (float)(1 << log2);
x = y / divisor; // FInd the remainder value between [1.0, 2.0] then calculate the natural log of this remainder using a polynomial approximation
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x; //This polynomial approximates ln(x) between [1,2]
result = result + ((float)log2) * LN_2 - LN_SCALING_FACTOR; // Using the log product rule Log(A) + Log(B) = Log(AB) and the log base change rule log_x(A) = log_y(A)/Log_y(x), calculate all the components in base e and then sum them: = Ln(x_remainder) + (log_2(x_integer) * ln(2)) - ln(SCALING_FACTOR)
return result;
}