Fixing Kalman smoothing function - c

I am using the following smoothing function to smooth speed readings from GPS.
void smoothing_init()
{
k = 0;
kalman_init(0.0625, 32, 1.3833094, 0);
}
void kalman_init(double _q, double _r, double _p, double intial_value)
{
q = _q;
r = _r;
p = _p;
x = intial_value;
}
double smoothing_add_sample(double measurement)
{
p = p + q;
k = p / (p + r);
x = x + k * (measurement - x);
p = (1 - k) * p;
return x;
}
However, sometimes this gives me smoothing values 700(the normal range is 0-150) and then going down. I guess it happens when I initialize routine with 0 but immediately receiving reading above 0 (for example 40, 50).
How can I tweak those functions to naturally prevent such spikes, but still be able to smooth data.

The Kalman filter is an estimator, which minimizes the variance (p in your code) of a system state (x) of a linear system. The variance p can be interpreted as a measure of confidence in the value of x.
For each time step k the filter performs two steps:
The prediction step propagates the estimate one time step ahead (in your case according to x[k+1] = x[k] + w[k] where w[k] is zero-mean Gaussian noise with variance q). It means here that the state variance is increased by q.
The filtering step incorporates measurement information (according to measurement[k] = x[k] + v[k] where v[k] is zero-mean Gaussian noise with variance r).
For classic estimation problems p is initialized with a very large value (little confidence in the initial value x). Over time p decreases to a value somewhere around q. Note the p is independent of the measurement, so it only depends on q,r and p[0].
So for the tweaking:
Initialize with a large p, i.e. p/r>>0; the initialization value of x does not really matter.
The quotient r/q adjusts how strong the smoothing is (r/q=0 then there's no smoothing at all).
To choose the magnitude of r,q, you can assume Gaussian noise: Then you have a 95% probability the true value lies in the confidence interval of [x-2*sqrt(p), x+2*sqrt(p)].

Related

Endless sine generation in C

I am working on a project which incorporates computing a sine wave as input for a control loop.
The sine wave has a frequency of 280 Hz, and the control loop runs every 30 µs and everything is written in C for an Arm Cortex-M7.
At the moment we are simply doing:
double time;
void control_loop() {
time += 30e-6;
double sine = sin(2 * M_PI * 280 * time);
...
}
Two problems/questions arise:
When running for a long time, time becomes bigger. Suddenly there is a point where the computation time for the sine function increases drastically (see image). Why is this? How are these functions usually implemented? Is there a way to circumvent this (without noticeable precision loss) as speed is a huge factor for us? We are using sin from math.h (Arm GCC).
How can I deal with time in general? When running for a long time, the variable time will inevitably reach the limits of double precision. Even using a counter time = counter++ * 30e-6; only improves this, but it does not solve it. As I am certainly not the first person who wants to generate a sine wave for a long time, there must be some ideas/papers/... on how to implement this fast and precise.
Instead of calculating sine as a function of time, maintain a sine/cosine pair and advance it through complex number multiplication. This doesn't require any trigonometric functions or lookup tables; only four multiplies and an occasional re-normalization:
static const double a = 2 * M_PI * 280 * 30e-6;
static const double dx = cos(a);
static const double dy = sin(a);
double x = 1, y = 0; // complex x + iy
int counter = 0;
void control_loop() {
double xx = dx*x - dy*y;
double yy = dx*y + dy*x;
x = xx, y = yy;
// renormalize once in a while, based on
// https://www.gamedev.net/forums/topic.asp?topic_id=278849
if((counter++ & 0xff) == 0) {
double d = 1 - (x*x + y*y - 1)/2;
x *= d, y *= d;
}
double sine = y; // this is your sine
}
The frequency can be adjusted, if needed, by recomputing dx, dy.
Additionally, all the operations here can be done, rather easily, in fixed point.
Rationality
As #user3386109 points out below (+1), the 280 * 30e-6 = 21 / 2500 is a rational number, thus the sine should loop around after 2500 samples exactly. We can combine this method with theirs by resetting our generator (x=1,y=0) every 2500 iterations (or 5000, or 10000, etc...). This would eliminate the need for renormalization, as well as get rid of any long-term phase inaccuracies.
(Technically any floating point number is a diadic rational. However 280 * 30e-6 doesn't have an exact representation in binary. Yet, by resetting the generator as suggested, we'll get an exactly periodic sine as intended.)
Explanation
Some requested an explanation down in the comments of why this works. The simplest explanation is to use the angle sum trigonometric identities:
xx = cos((n+1)*a) = cos(n*a)*cos(a) - sin(n*a)*sin(a) = x*dx - y*dy
yy = sin((n+1)*a) = sin(n*a)*cos(a) + cos(n*a)*sin(a) = y*dx + x*dy
and the correctness follows by induction.
This is essentially the De Moivre's formula if we view those sine/cosine pairs as complex numbers, in accordance to Euler's formula.
A more insightful way might be to look at it geometrically. Complex multiplication by exp(ia) is equivalent to rotation by a radians. Therefore, by repeatedly multiplying by dx + idy = exp(ia), we incrementally rotate our starting point 1 + 0i along the unit circle. The y coordinate, according to Euler's formula again, is the sine of the current phase.
Normalization
While the phase continues to advance with each iteration, the magnitude (aka norm) of x + iy drifts away from 1 due to round-off errors. However we're interested in generating a sine of amplitude 1, thus we need to normalize x + iy to compensate for numeric drift. The straight forward way is, of course, to divide it by its own norm:
double d = 1/sqrt(x*x + y*y);
x *= d, y *= d;
This requires a calculation of a reciprocal square root. Even though we normalize only once every X iterations, it'd still be cool to avoid it. Fortunately |x + iy| is already close to 1, thus we only need a slight correction to keep it at bay. Expanding the expression for d around 1 (first order Taylor approximation), we get the formula that's in the code:
d = 1 - (x*x + y*y - 1)/2
TODO: to fully understand the validity of this approximation one needs to prove that it compensates for round-off errors faster than they accumulate -- and thus get a bound on how often it needs to be applied.
The function can be rewritten as
double n;
void control_loop() {
n += 1;
double sine = sin(2 * M_PI * 280 * 30e-6 * n);
...
}
That does exactly the same thing as the code in the question, with exactly the same problems. But it can now be simplified:
280 * 30e-6 = 280 * 30 / 1000000 = 21 / 2500 = 8.4e-3
Which means that when n reaches 2500, you've output exactly 21 cycles of the sine wave. Which means that you can set n back to 0.
The resulting code is:
int n;
void control_loop() {
n += 1;
if (n == 2500)
n = 0;
double sine = sin(2 * M_PI * 8.4e-3 * n);
...
}
As long as your code can run for 21 cycles without problems, it'll run forever without problems.
I'm rather shocked at the existing answers. The first problem you detect is easily solved, and the next problem magically disappears when you solve the first problem.
You need a basic understanding of math to see how it works. Recall, sin(x+2pi) is just sin(x), mathematically. The large increase in time you see happens when your sin(float) implementation switches to another algorithm, and you really want to avoid that.
Remember that float has only 6 significant digits. 100000.0f*M_PI+x uses those 6 digits for 100000.0f*M_PI, so there's nothing left for x.
So, the easiest solution is to keep track of x yourself. At t=0 you initialize x to 0.0f. Every 30 us, you increment x+= M_PI * 280 * 30e-06;. The time does not appear in this formula! Finally, if x>2*M_PI, you decrement x-=2*M_PI; (Since sin(x)==sin(x-2*pi)
You now have an x that stays nicely in the range 0 to 6.2834, where sin is fast and the 6 digits of precision are all useful.
How to generate a lovely sine.
DAC is 12bits so you have only 4096 levels. It makes no sense to send more than 4096 samples per period. In real life you will need much less samples to generate a good quality waveform.
Create C file with the lookup table (using your PC). Redirect the output to the file (https://helpdeskgeek.com/how-to/redirect-output-from-command-line-to-text-file/).
#define STEP ((2*M_PI) / 4096.0)
int main(void)
{
double alpha = 0;
printf("#include <stdint.h>\nconst uint16_t sine[4096] = {\n");
for(int x = 0; x < 4096 / 16; x++)
{
for(int y = 0; y < 16; y++)
{
printf("%d, ", (int)(4095 * (sin(alpha) + 1.0) / 2.0));
alpha += STEP;
}
printf("\n");
}
printf("};\n");
}
https://godbolt.org/z/e899d98oW
Configure the timer to trigger the overflow 4096*280=1146880 times per second. Set the timer to generate the DAC trigger event. For 180MHz timer clock it will not be precise and the frequency will be 279.906449045Hz. If you need better precision change the number of samples to match your timer frequency or/and change the timer clock frequency (H7 timers can run up to 480MHz)
Configure DAC to use DMA and transfer the value from the lookup table created in the step 1 to the DAC on the trigger event.
Enjoy beautiful sine wave using your oscilloscope. Note that your microcontroller core will not be loaded at all. You will have it for other tasks. If you want to change the period simple reconfigure the timer. You can do it as many times per second as you wish. To reconfigure the timer use timer DMA burst mode - which will reload PSC & ARR registers on the upddate event automatically not disturbing the generated waveform.
I know it is advanced STM32 programming and it will require register level programming. I use it to generate complex waveforms in our devices.
It is the correct way of doing it. No control loops, no calculations, no core load.
I'd like to address the embedded programming issues in your code directly - #0___________'s answer is the correct way to do this on a microcontroller and I won't retread the same ground.
Variables representing time should never be floating point. If your increment is not a power of two, errors will always accumulate. Even if it is, eventually your increment will be smaller than the smallest increment and the timer will stop. Always use integers for time. You can pick an integer size big enough to ignore roll over - an unsigned 32 bit integer representing milliseconds will take 50 days to roll over, while an unsigned 64 bit integer will take over 500 million years.
Generating any periodic signal where you do not care about the signal's phase does not require a time variable. Instead, you can keep an internal counter which resets to 0 at the end of a period. (When you use DMA with a look-up table, that's exactly what you're doing - the counter is the DMA controller's next-read pointer.)
Whenever you use a transcendental function such as sine in a microcontroller, your first thought should be "can I use a look-up table for this?" You don't have access to the luxury of a modern operating system optimally shuffling your load around on a 4 GHz+ multi-core processor. You're often dealing with a single thread that will stall waiting for your 200 MHz microcontroller to bring the FPU out of standby and perform the approximation algorithm. There is a significant cost to transcendental functions. There's a cost to LUTs too, but if you're hitting the function constantly, there's a good chance you'll like the tradeoffs of the LUT a lot better.
As noted in some of the comments, the time value is continually growing with time. This poses two problems:
The sin function likely has to perform a modulus internally to get the internal value into a supported range.
The resolution of time will become worse and worse as the value increases, due to adding on higher digits.
Making the following changes should improve the performance:
double time;
void control_loop() {
time += 30.0e-6;
if((1.0/280.0) < time)
{
time -= 1.0/280.0;
}
double sine = sin(2 * M_PI * 280 * time);
...
}
Note that once this change is made, you will no longer have a time variable.
Use a look-up table. Your comment in the discussion with Eugene Sh.:
A small deviation from the sine frequency (like 280.1Hz) would be ok.
In that case, with a control interval of 30 µs, if you have a table of 119 samples that you repeat over and over, you will get a sine wave of 280.112 Hz. Since you have a 12-bit DAC, you only need 119 * 2 = 238 bytes to store this if you would output it directly to the DAC. If you use it as input for further calculations like you mention in the comments, you can store it as float or double as desired. On an MCU with embedded static RAM, it only takes a few cycles at most to load from memory.
If you have a few kilobytes of memory available, you can eliminate this problem completely with a lookup table.
With a sampling period of 30 µs, 2500 samples will have a total duration of 75 ms. This is exactly equal to the duration of 21 cycles at 280 Hz.
I haven't tested or compiled the following code, but it should at least demonstrate the approach:
double sin2500() {
static double *table = NULL;
static int n = 2499;
if (!table) {
table = malloc(2500 * sizeof(double));
for (int i=0; i<2500; i++) table[i] = sin(2 * M_PI * 280 * i * 30e-06);
}
n = (n+1) % 2500;
return table[n];
}
How about a variant of others' modulo-based concept:
int t = 0;
int divisor = 1000000;
void control_loop() {
t += 30 * 280;
if (t > divisor) t -= divisor;
double sine = sin(2 * M_PI * t / (double)divisor));
...
}
It calculates the modulo in integer then causes no roundoff errors.
There is an alternative approach to calculating a series of values of sine (and cosine) for angles that increase by some very small amount. It essentially devolves down to calculating the X and Y coordinates of a circle, and then dividing the Y value by some constant to produce the sine, and dividing the X value by the same constant to produce the cosine.
If you are content to generate a "very round ellipse", you can use a following hack, which is attributed to Marvin Minsky in the 1960s. It's much faster than calculating sines and cosines, although it introduces a very small error into the series. Here is an extract from the Hakmem Document, Item 149. The Minsky circle algorithm is outlined.
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse centered at the origin with its size determined by the initial point. epsilon determines the angular velocity of the circulating point, and slightly affects the eccentricity. If epsilon is a power of 2, then we don't even need multiplication, let alone square roots, sines, and cosines! The "circle" will be perfectly stable because the points soon become periodic.
The circle algorithm was invented by mistake when I tried to save one register in a display hack! Ben Gurley had an amazing display hack using only about six or seven instructions, and it was a great wonder. But it was basically line-oriented. It occurred to me that it would be exciting to have curves, and I was trying to get a curve display hack with minimal instructions.
Here is a link to the hakmem: http://inwap.com/pdp10/hbaker/hakmem/hacks.html
I think it would be possible to use a modulo because sin() is periodic.
Then you don’t have to worry about the problems.
double time = 0;
long unsigned int timesteps = 0;
double sine;
void controll_loop()
{
timesteps++;
time += 30e-6;
if( time > 1 )
{
time -= 1;
}
sine = sin( 2 * M_PI * 280 * time );
...
}
Fascinating thread. Minsky's algorithm mentioned in Walter Mitty's answer reminded me of a method for drawing circles that was published in Electronics & Wireless World and that I kept. (Credit: https://www.electronicsworld.co.uk/magazines/). I'm attaching it here for interest.
However, for my own similar projects (for audio synthesis) I use a lookup table, with enough points that linear interpolation is accurate enough (do the math(s)!)

Simple integration that depends on floating point equality

I have the following very-crude integration calculator:
// definite integrate on one variable
// using basic trapezoid approach
float integrate(float start, float end, float step, float (*func)(float x))
{
if (start >= (end-step))
return 0;
else {
float x = start; // make it a bit more math-like
float segment = step * (func(x) + func(x+step))/2;
return segment + integrate(x+step, end, step, func);
}
}
And an example usage:
static float square(float x) {return x*x;}
int main(void)
{
// Integral x^2 from 0->2 should be ~ 2.6
float start=0.0, end=2.0, step=0.01;
float answer = integrate(start, end, step, square);
printf("The integral from %.2f to %.2f for X^2 = %.2f\n", start, end, answer );
}
$ run
The integral from 0.00 to 2.00 for X^2 = 2.67
What happens if the equality check at start >= (end-step) doesn't work? For example, if it evaluates something to 2.99997 instead of 3 and so does another loop (or one less loop). Is there a way to prevent that, or do most math-type calculators just work in decimals or some extension to the 'normal' floating points?
If you are given step, one way to write a loop (and you should use a loop for this, not recursion) is:
float x;
for (float i = 0; (x = start + i*step) < end - step/2; ++i)
…
Some points about this:
We keep an integer count with i. As long as there are a reasonable number of steps, there will be no floating-point rounding error in this. (We could make i and int, but float can count integer values perfectly well, and using float avoids an int-to-float conversion in i*step.)
Instead of incrementing x (or start as it is passed by recursion) repeatedly, we recalculate it each time as start + i*step. This has only two possible rounding errors, in the multiplication and in the addition, so it avoids accumulating errors over repeated additions.
We use end - step/2 as the threshold. This allows us to catch the desired endpoint even if the calculated x drifts as far away from end as end - step/2. And that is about the best we can do, because if it is drifting farther than half a step away from the ideally spaced points, we cannot tell if it has drifted +step/2 from end-step or -step/2 from end.
This presumes that step is an integer division of end-start, or pretty close to it, so that there are a whole number of steps in the loop. If it is not, the loop should be redesigned a bit to stop one step earlier and then calculate a step of partial width at the end.
At the beginning, I mentioned being given step. An alternative is you might be given a number of steps to use, and then the step width would be calculated from that. In that case, we would use an integer number of steps to control the loop. The loop termination condition would not involve floating-point rounding at all. We could calculate x as (float) i / NumberOfSteps * (end-start) + start.
Two improvements can be made easily.
Using recursion is a bad idea. Each additional call creates a new stack frame. For a sufficiently large number of steps, you will trigger a Stack Overflow. Use a loop instead.
Normally, you would avoid the rounding problem by using start, end and n, the number of steps. The location of the kth interval would be at start + k * (end - start) / n;
So you could rewrite your function as
float integrate(float start, float end, int n, float (*func)(float x))
{
float next = start;
float sum = 0.0f;
for(int k = 0; k < n; k++) {
float x = next;
next = start + k * (end - start) / n;
sum += 0.5f * (next - x) * (func(x) + func(next));
}
return sum;
}

Efficiently computing (a - K) / (a + K) with improved accuracy

In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive variable argument and K is a constant. In many cases, K is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma() and fmaf().
For ease of exploration, I am experimenting with float arithmetic. Since I plan to port the approach to double arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a in the interval [K/2, 4.23*K], code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K is a power of 2, and there is no overflow or underflow in intermediate results. For K not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K = 2n: When the upper bound of the work interval increases to 8*K, 16*K, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo, i.e. similar to the well-known double-float and double-double formats. In my code above, this would mean changing the last line to qlo = r * e.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow(). But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2 (for single precision) or K=4 (for double precision) to keep the primary approximation interval narrow, and the interval for a is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m is computed exactly if a &in; [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
This is similar to a simplified form of the div2 algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m as the sum of 2 floats, or use a double).
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k and a-k.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits
in the input
test code
https://ideone.com/bHxAg8
The problem is the addition in (a + K). Any loss of precision in (a + K) is magnified by the division. The problem isn't the division itself.
If the exponents of a and K are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a (if a has larger magnitude) or (a + K) == K (if K has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest (where smallest is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp; won't help much either because you've still got exactly the same (a + K) problem (but it would avoid a similar problem in (a - K)). Also you can't do result = anything / p + anything_error/p_error because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K); then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;. Next do a_lower = a and decrease a_lower by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound, then revert back to the previous value of a_lower. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound and a_upper (starting with the original value of a). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)

Calculate maclaurin series for sin using C

I wrote a code for calculating sin using its maclaurin series and it works but when I try to calculate it for large x values and try to offset it by giving a large order N (the length of the sum) - eventually it overflows and doesn't give me correct results. This is the code and I would like to know is there an additional way to optimize it so it works for large x values too (it already works great for small x values and really big N values).
Here is the code:
long double calcMaclaurinPolynom(double x, int N){
long double result = 0;
long double atzeretCounter = 2;
int sign = 1;
long double fraction = x;
for (int i = 0; i <= N; i++)
{
result += sign*fraction;
sign = sign*(-1);
fraction = fraction*((x*x) / ((atzeretCounter)*(atzeretCounter + 1)));
atzeretCounter += 2;
}
return result;
}
The major issue is using the series outside its range where it well converges.
As OP said "converted x to radX = (x*PI)/180" indicates the OP is starting with degrees rather than radians, the OP is in luck. The first step in finding my_sin(x) is range reduction. When starting with degrees, the reduction is exact. So reduce the range before converting to radians.
long double calcMaclaurinPolynom(double x /* degrees */, int N){
// Reduce to range -360 to 360
// This reduction is exact, no round-off error
x = fmod(x, 360);
// Reduce to range -180 to 180
if (x >= 180) {
x -= 180;
x = -x;
} else if (x <= -180) {
x += 180;
x = -x;
}
// Reduce to range -90 to 90
if (x >= 90) {
x = 180 - x;
} else if (x <= -90) {
x = -180 - x;
}
//now convert to radians.
x = x*PI/180;
// continue with regular code
Alternative, if using C11, use remquo(). Search SO for sample code.
As #user3386109 commented above, no need to "convert back to degrees".
[Edit]
With typical summation series, summing the least significant terms first improves the precision of the answer. With OP's code this can be done with
for (int i = N; i >= 0; i--)
Alternatively, rather than iterating a fixed number of times, loop until the term has no significance to the sum. The following uses recursion to sum the least significant terms first. With range reduction in the -90 to 90 range, the number of iterations is not excessive.
static double sin_d_helper(double term, double xx, unsigned i) {
if (1.0 + term == 1.0)
return term;
return term - sin_d_helper(term * xx / ((i + 1) * (i + 2)), xx, i + 2);
}
#include <math.h>
double sin_d(double x_degrees) {
// range reduction and d --> r conversion from above
double x_radians = ...
return x_radians * sin_d_helper(1.0, x_radians * x_radians, 1);
}
You can avoid the sign variable by incorporating it into the fraction update as in (-x*x).
With your algorithm you do not have problems with integer overflow in the factorials.
As soon as x*x < (2*k)*(2*k+1) the error - assuming exact evaluation - is bounded by abs(fraction), i.e., the size of the next term in the series.
For large x the biggest source for errors is truncation resp. floating point errors that are magnified via cancellation of the terms of the alternating series. For k about x/2 the terms around the k-th term have the biggest size and have to be offset by other big terms.
Halving-and-Squaring
One easy method to deal with large x without using the value of pi is to employ the trigonometric theorems where
sin(2*x)=2*sin(x)*cos(x)
cos(2*x)=2*cos(x)^2-1=cos(x)^2-sin(x)^2
and first reduce x by halving, simultaneously evaluating the Maclaurin series for sin(x/2^n) and cos(x/2^n) and then employ trigonometric squaring (literal squaring as complex numbers cos(x)+i*sin(x)) to recover the values for the original argument.
cos(x/2^(n-1)) = cos(x/2^n)^2-sin(x/2^n)^2
sin(x/2^(n-1)) = 2*sin(x/2^n)*cos(x/2^n)
then
cos(x/2^(n-2)) = cos(x/2^(n-1))^2-sin(x/2^(n-1))^2
sin(x/2^(n-2)) = 2*sin(x/2^(n-1))*cos(x/2^(n-1))
etc.
See https://stackoverflow.com/a/22791396/3088138 for the simultaneous computation of sin and cos values, then encapsulate it with
def CosSinForLargerX(x,n):
k=0
while abs(x)>1:
k+=1; x/=2
c,s = getCosSin(x,n)
r2=0
for i in range(k):
s2=s*s; c2=c*c; r2=s2+c2
s = 2*c*s
c = c2-s2
return c/r2,s/r2

Finding the sample at the beginning of a period of a compound periodic signal

I have a signal made up of the sum of a number of sine waves. These are spaced at 100Hz, with the lowest component frequency at 200Hz (200Hz, 300Hz...etc.) All component sine waves begin at the same point with phase = 0. In my DSP software, where I am going to multiply this signal by several other signals, I need to find a point at which all of the original signal's component signals are all again at phase = 0.
If I were only using one sine wave, I could simply look for a change in sign from negative to positive. However, if the signal has, say, components at 200Hz and 300Hz, there are three zero-crossing where the sign changes from negative to positive, but only one that represents the beginning of the period, and this increases with more component waves. I do have control over the amplitudes of each component frequency during an initial startup sequence. If these waves were strictly harmonic (200Hz, 400Hz, 800Hz, etc.), I could simply remove all but the lowest frequency, find the beginning of its period, and use this as my zero-sample. However, I don't have this bandwidth. Can anyone provide an alternative approach?
Edit:
(I have clarified and integrated this edit into body of question.)
Edit 2:
This graphic should demonstrate the issue. The frequencies two components here are n and 3n/2. Without filtering out all but the lowest frequency, or taking an FFT as proposed by #hotpaw, an algorithm that only looks for zero-crossings where the sign changes from negative positive will land on one of three, and I must find the first of those three (this is the one point at which each component signal is at phase = 0). I realise that taking an FFT will work, but I'm dealing with very limited processing power and wondering if there's a simpler approach.
Look at the derivative of the signal!
Your signal is a sum of sines (sorry, I'm not sure how to format formulas properly)
S = sum(a_n * sin(k_n * t)) ... over all n
a_n is the positive amplitude and k_n the positive frequency. The derivative (that you can compute easily numerically) of the signal is
dS/dt = sum(a_n * k_n * cos(k_n * t)) ... over all n
At t=0 (what you're looking for), the derivative has its maximum since all cosine terms are one at the same time.
Some addition:
For the practical implementation you need to consider that the derivative may be noisy, so some kind of simple first-order filtering could be necessary.
I assume that all the sine waves are exact harmonics of some fundamental frequency, all have a phase of zero with respect to the same reference point at some point in time, and that this is the point in time you wish to find.
You can use an FFT with an aperture length that is an exact multiple of the period of your fundamental frequency (100 Hz). If there is zero noise, you can use 1 period. Estimate the phase with respect to some reference point (FFT aperture start or center) of all the sinusoids using the FFT. Then use the phase of the lowest frequency sinusoid that shows up as significant in the FFT to calculate all its zero crossings in your target time range. Compare with the nearest zero crossing of all the other sinusoids (using the FFT phase to estimate their phases), and find the low frequency zero crossing with the total least squared error of offsets from all the nearest zero crossings of all the other frequencies.
You can go back to the time domain to confirm the least squares estimated crossing as an actual zero crossing and/or to remove some of the numerical noise.
I would go for a first or second order lowpass filter to remove the component frequencies. The difference between 100 Hz and the "noise" makes quite a wide gap. Start with a low frequency that cancels all noise and increase until you are satisfied with the signal.
After that you have your signal and can watch for the sign change.
Second order implementation:
static float a1 = 0;
static float a2 = 0;
static float b1 = 0;
static float b2 = 0;
static float y = 0;
static float y_old = 0;
static float u_old = 0;
void
init_lp_filter(float cutoff_freq, float sample_time)
{
float wc = cutoff_freq;
float h = sample_time;
float epsilon = 1.0f/sqrt(2.0f);
float omega = wc * sqrt(0.5f);
float alpha = exp(-epsilon*wc*h);
float beta = cos(omega*h);
float gamma = sin(omega*h);
b1 = 1.0f - alpha * (beta + epsilon * wc * gamma / omega);
b2 = alpha * alpha + alpha * (epsilon * wc * gamma / omega - beta);
a1 = -2.0f * alpha * beta;
a2 = alpha * alpha;
}
float
getOutput() {
return y;
}
void
update_filter(float input)
{
float tmp = y;
y = b1 * input + b2 * u_old - a1 * y - a2 * y_old;
y_old = tmp;
u_old = input;
}
As the filtered output depends only on old values, this means that the filtered output can be used direct at the beginning of a cycle. The filter can then be updated at the end of the periodic cycle with a sample of measurement. Do note that if you have any output that may affect the signal (i.e. actuators on a physical process), you must sample the signal before any output.)
Good luck!

Resources