float sinx(float x)
{
static const float a[] = {-.1666666664,.0083333315,-.0001984090,.0000027526,-.0000000239};
float xsq = x*x;
float temp = x*(1 + a[0]*xsq + a[1]*xsq*xsq + a[2]* xsq*xsq*xsq+a[3]*xsq*xsq*xsq*xsq+ a[4]*xsq*xsq*xsq*xsq*xsq);
return temp;
}
How are those constants calculated? How to calculate cos and tan using this method?
Can I extend this to get more precision? I guess I need to add more constants?
Plot of the error of the "fast" sine described above against a Taylor polynomial of equal degree.
Nearly all answers at the time of this writing refer to the Taylor expansion of function sin, but if the author of the function were serious, he would not use Taylor coefficients. Taylor coefficients tend to produce a polynomial approximation that's better than necessary near zero and increasingly bad away from zero. The goal is usually to obtain an approximation that is uniformly good on a range such as -π/2…π/2. For a polynomial approximation, this can be obtained by applying the Remez algorithm. A down-to-earth explanation is this post.
The polynomial coefficients obtained by that method are close to the Taylor coefficients, since both polynomials are trying to approximate the same function, but the polynomial may be
more precise for the same number of operations, or involve fewer operations for the same (uniform) quality of approximation.
I cannot tell just by looking at them if the coefficients in your question are exactly Taylor coefficients or the slightly different coefficients obtained by the Remez algorithm, but it is probably what should have been used even if it wasn't.
Lastly, whoever wrote (1 + a[0]*xsq + a[1]*xsq*xsq + a[2]* xsq*xsq*xsq+a[3]*xsq*xsq*xsq*xsq+ a[4]*xsq*xsq*xsq*xsq*xsq) needs to read up about better polynomial evaluation schemes, for instance Horner's:
1 + xsq*(a[0]+ xsq*(a[1] + xsq*(a[2] + xsq*(a[3] + xsq*a[4])))) uses N multiplications instead of N2/2.
They are -1/6, 1/120, -1/5040 .. and so on.
Or rather: -1/3!, 1/5!, -1/7!, 1/9!... etc
Look at the taylor series for sin x in here:
It has cos x right below it:
For cos x, as seen from the picture above, the constants are -1/2!, 1/4!, -1/6!, 1/8!...
tan x is slightly different:
So to adjust this for cosx:
float cosx(float x)
{
static const float a[] = {-.5, .0416666667,-.0013888889,.0000248016,-.0000002756};
float xsq = x*x;
float temp = (1 + a[0]*xsq + a[1]*xsq*xsq + a[2]* xsq*xsq*xsq+a[3]*xsq*xsq*xsq*xsq+ a[4]*xsq*xsq*xsq*xsq*xsq);
return temp;
}
The coefficients are identical to those given in the Handbook of Mathematical Functions, ed. Abramowitz and Stegan (1964), page 76, and are attributed to Carlson and Goldstein, Rational approximations of functions, Los Alamos Scientific Laboratory (1955).
The first can be found in http://www.jonsson.eu/resources/hmf/pdfwrite_600dpi/hmf_600dpi_page_76.pdf.
And the second at http://www.osti.gov/bridge/servlets/purl/4374577-0deJO9/4374577.pdf. Page 37 gives:
Regarding your third question, "Can I extend this to get more precision?", http://lol.zoy.org/wiki/doc/maths/remez has a downloadable C++ implementation of the Remez algorithm; it provides (unchecked by me) the coefficients for the 6th-order polynomial for sin:
error: 3.9e-14
9.99999999999624e-1
-1.66666666660981e-1
8.33333330841468e-3
-1.98412650240363e-4
2.75568408741356e-6
-2.50266363478673e-8
1.53659375573646e-10
Or course, you would need to change from float to double to realize any improvement. And this may also answer your second question, regarding cos and tan.
Also, I see in the comments that a fixed-point answer is required in the end. I implemented a 32-bit fixed-point version in 8031-assembler about 26 years ago; I'll try digging it up to see whether it has anything useful in it.
Update: If you are stuck with 32-bit doubles, then the only way I can see for you to increase the accuracy by a "digit or two" is to forget floating-point and use fixed-point. Surprisingly, google doesn't seem to turn up anything. The following code provides proof-of-concept, run on a standard Linux machine:
#include <stdio.h>
#include <math.h>
#include <stdint.h>
// multiply two 32-bit fixed-point fractions (no rounding)
#define MUL32(a, b) ((uint64_t)(a) * (b) >> 32)
// sin32: Fixed-point sin calculation for first octant, coefficients from
// Handbook for Computing Elementary Functions, by Lyusternik et al, p. 89.
// input: 0 to 0xFFFFFFFF, giving fraction of octant 0 to PI/8, relative to 2**32
// output: 0 to 0.7071, relative to 2**32
static uint32_t sin32(uint32_t x) { // x in 1st octant, = radians/PI*8*2**32
uint32_t y, x2 = MUL32(x, x); // x2 = x * x
y = 0x000259EB; // a7 = 0.000 035 877 1
y = 0x00A32D1E - MUL32(x2, y); // a5 = 0.002 489 871 8
y = 0x14ABBA77 - MUL32(x2, y); // a3 = 0.080 745 367 2
y = 0xC90FDA73u - MUL32(x2, y); // a1 = 0.785 398 152 4
return MUL32(x, y);
}
int main(void) {
int i;
for (i = 0; i < 45; i += 2) { // 0 to 44 degrees
const double two32 = 1LL << 32;
const double radians = i * M_PI / 180;
const uint32_t octant = i / 45. * two32; // fraction of 1st octant
printf("%2d %+.10f %+.10f %+.10f %+.0f\n", i,
sin(radians) - sin32(octant) / two32,
sin(radians) - sinf(radians),
sin(radians) - (float)sin(radians),
sin(radians) * two32 - sin32(octant));
}
return 0;
}
The coefficients are from the Handbook for Computing Elementary Functions, by Lyusternik et al, p. 89, here:
The only reason I choose this particular function is that it has one less term than your original series.
The results are:
0 +0.0000000000 +0.0000000000 +0.0000000000 +0
2 +0.0000000007 +0.0000000003 +0.0000000012 +3
4 +0.0000000010 +0.0000000005 +0.0000000031 +4
6 +0.0000000012 -0.0000000029 -0.0000000011 +5
8 +0.0000000014 +0.0000000011 -0.0000000044 +6
10 +0.0000000014 +0.0000000050 -0.0000000009 +6
12 +0.0000000011 -0.0000000057 +0.0000000057 +5
14 +0.0000000006 -0.0000000018 -0.0000000061 +3
16 -0.0000000000 +0.0000000021 -0.0000000026 -0
18 -0.0000000005 -0.0000000083 -0.0000000082 -2
20 -0.0000000009 +0.0000000095 -0.0000000107 -4
22 -0.0000000010 -0.0000000007 +0.0000000139 -4
24 -0.0000000009 -0.0000000106 +0.0000000010 -4
26 -0.0000000005 +0.0000000065 -0.0000000049 -2
28 -0.0000000001 -0.0000000032 -0.0000000110 -0
30 +0.0000000005 -0.0000000126 -0.0000000000 +2
32 +0.0000000010 +0.0000000037 -0.0000000025 +4
34 +0.0000000015 +0.0000000193 +0.0000000076 +7
36 +0.0000000013 -0.0000000141 +0.0000000083 +6
38 +0.0000000007 +0.0000000011 -0.0000000266 +3
40 -0.0000000005 +0.0000000156 -0.0000000256 -2
42 -0.0000000009 -0.0000000152 -0.0000000170 -4
44 -0.0000000005 -0.0000000011 -0.0000000282 -2
Thus we see that this fixed-point calculation is about ten times more accurate than sinf() or (float)sin(), and is correct to 29 bits. Using rounding rather than truncation in MUL32() made only a marginal improvement.
That function is calculating the value of sin using its Taylor expansion:
and those constants are the various -1/3!, 1/5! and so on (see e.g. here for Taylor series of other functions).
Now, the Taylor expansion for sin(x) is exact for every x if you specified every term of the series, but AFAIK there are faster and more precise methods to determine the trigonometric functions in software.
Also, many processors provide such functions implemented directly in the processor (e.g. on x86 there are ready-made opcodes for them), so often there's no need to bother with this stuff.
cos(x) = sqrt(1 - sin^2(x))
tan(x) = sin(x)/cos(x)
Sin(x) = x -x^3/3! + x^5/5! + (-1)^k*x^(2k+1)/(2k+1)! , k = 1, 2, ...
this is infinite function
Related
I want to know whether the program defined below can return 1 assuming:
IEEE754 floating point arithmetics
no overflow (neither in max/x nor in f*x)
no nan or inf (obviously)
0 < x and 0 < n < 32
no unsafe math optimization
int canfail(int n, double x) {
double max = 1ULL << n; // 2^n
double f = max / x;
return f * x > max;
}
In my opinion, it should sometime return 1, as roundToNearest(max / x) can in general be greater than max/x.
I'm able to find numbers for the opposite case, where f * x < max, but I have no examples of input that show f * x > max and I have no idea of how to find one. Can somebody help ?
EDIT:
I know the value of x if in a range between 10^(-6) and 10^6 (that still leaves a lot (too much possible double values), but I know I will not have to deal with overflow, underflow or sub-normal numbers !
In addition, I just realized that because max is a power of two and we don't deal with overflow, the solution will be the same by fixing max=1 as it is exactly the same computation, but shifted.
Therefore, the problem correspond to finding a positive, normal double value x such that `(1/x) * x > 1.0 !!
I made a little program to try to find a solution:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>
#include <omp.h>
int main( void ) {
#pragma omp parallel
{
unsigned short int xsubi[3] = {
omp_get_thread_num(),
omp_get_thread_num(),
omp_get_thread_num()
};
#pragma omp for
for(int64_t i=0; i<INT64_MAX; i++) {
double x = fmod(nrand48(xsubi), 1048576.0);
if(x<0.000001)
continue;
double f = 1.0 / x;
if(f * x > 1.0) {
printf("found !!! x=%.30f\n", x);
fflush(stdout);
}
}
}
return 1;
}
If you change the sign of the comparison, you will find some value quickly. However, it seems to run forever with f * x > 1.0
In the absence of underflow or overflow, the exponents are irrelevant; if M/x*x > M, then (M/p) / (x/q) * (x/q) > (M/p) for any powers of two p and q. So let’s consider 252 ≤ x < 253 and M = 2105. We can eliminate x = 252 since this yields exact floating-point arithmetic, so 252 < x < 253.
Division of 2105 by x yields integer quotient q and integer remainder r, with 252 q < 253, 0 < r < x, and 2105 = q•x + r.
In order for M/x*x to exceed M, both the division and the multiplication must round up. Since the division rounds up, x/2 ≤ r.
With rounding up, the result of floating-point division of 2105 by x yields q+1. Then the exact (not rounded) multiplication yields (q+1)•x = q•x + x = q•x + x + r - r = q•x + r + x − r = 2105 + x − r. Since x/2 < r, x − r ≤ x/2, so rounding this exact result rounds down, yielding 2105. (The “<” case always rounds down, and the “=” case rounds down because 2105 has the low even bit.)
Therefore, for powers of two M and all arithmetic within exponent bounds, M/x*x > M never occurs with round-to-nearest-ties-to-even.
Multiplication by a power of two is just a scaling of exponent, it does not change the problem: so it's the same as finding x such that (1/x) * x > 1.
One solution is brute force search.
For same reasons, we can limit the search of such x in the interval (1.0,2.0(
A better approach is to analyze error bounds without brute force.
Let's note ix the nearest floating point to 1/x.
Considering xand ixas exact fractions, we can write the integer division: 1 = ix * x + r where ris the remainder
(these are all fractions with denominators being powers of 2, so we have to multiply the whole equation by appropriate power of 2 to really have integer division).
In other words, ix = 1/x - r/x, where -r/x is the rounding error of inversion.
When we multiply the inverse approximation by x, the exact value is ix*x = 1 - r.
We know that the floating point result will be rounded to the nearest float to that exact value.
So, assumming default rounding mode to nearest, tie to even, the question asked is whether -r can exceed 0.5 ulp.
The short answer is never!
Suppose |r| > 0.5 ulp, then the rounding error -r/x does exceed half ulp of exact result 1/x.
This is not a proper answer, because the exact result is not a floating point and does not have an ulp, but you get the idea...
I might come back with a correct proof if i have time, but my bet is that you can find it already done, possibly on SO
EDIT
Why can you find (1/x) * x < 1?
Simply because 1.0 is at a binade limit, so below 1, we have to prove that r<0.25 ulp, what we cannot...
canfail(1, pow(2, 1023) * (2 - pow(2, -51))) will return 1.
I'm working on a project for our school and we are required to create a program that computes the approximation of the Taylor Expansion Series of sin x and cos x, only using <stdio.h> and without user-defined functions other than int main(), of all angles from -180 to 180 in increments of +5. the following is my code:
#include <stdio.h>
#define PI 3.141592653589
#define NUMBER_OF_TERMS 10
int
main()
{
int cosctr, sinctr;
double ctr, radi;
double cosaccu, costerm, sinaccu, sinterm;
for (ctr = -180; ctr < 185; ctr = ctr + 5) {
radi = ctr * PI/180.0;
cosctr = 1;
cosaccu = 1;
costerm = 1;
sinctr = 2;
sinaccu = radi;
sinterm = radi;
while (cosctr <= 2*NUMBER_OF_TERMS) {
costerm = costerm*(-1)*(radi*radi)/(cosctr*(cosctr + 1));
cosaccu = cosaccu + costerm;
cosctr+=2;
} do {
sinterm = sinterm*(-1)*(radi*radi)/(sinctr*(sinctr + 1));
sinaccu = sinaccu + sinterm;
sinctr+=2;
} while (sinctr <= 2*NUMBER_OF_TERMS);
printf("%.2lf %.12lf %.12lf %.12lf\n", ctr, radi, cosaccu, sinaccu);
} return 0;
}
The code above is accurate for a 15 terms expansion approximation. however, if I change NUMBER_OF_TERMS to, for example, 5 or 10, the approximation is flawed.
Any suggestions?
Let me clarify: I need to obtain an approximation of 5 terms, 10 terms, and 15 terms. I cannot use any other library other than <stdio.h>. I cannot use any other functions outside of int main() (I apologize for the vagueness of my explanation before).
Please answer with the included corrected code.
The key to high precession, yet simple calculation of sind(degrees) and cosd(degrees) is to reduce the range of degree to 0 to 90 first (or even 0 to 45), using the usual trigonometric adjustments with degree arrangements first.
Reductions:
angle = fmod(angle, 360) // reduce (-360..360) or use a = a - (int)(a/360)
sin(x) = -sin(-x) // reduce to [0..360)
cos(x) = cos(-x) // reduce to [0..360)
sin(x) = -sin(x-180) // reduce to [0..180)
cos(x) = -cos(x-180) // reduce to [0..180)
sin(x) = cos(90-x) // reduce to [0..90)
Further reductions:
For [45-90) use sin(x) = cos(90-x) // reduce to [0..45)
then convert to radians and use Taylor series expansion.
Example
Note: Since code is dealing with double, typically 17 digits of precision, no need to use a course PI approximation.
// #define PI 3.141592653589
#define PI 3.1415926535897932384626433832795
I tried your code; it works fine for me, in that it does what it looks like it's designed to do. Here's a comparison between your code's output for the cosine at 5 and 10 terms and the same approximation as calculated by Mathematica. They agree up to <10^-12, i.e. your outputted precision.:
The only problem I see with your code is that, with the way you designed your loops, you're actually taking into account NUMBER_OF_TERMS + 1 terms if you count the first terms in the expansion (i.e. the constant term for the cosine, the linear term for the sine.) You start with this first term, and then your loop adds another NUMMBER_OF_TERMS terms. If that is not by design, you're actually approximating the functions with higher precision that you are expecting.
By its very definition, a Taylor series is a summation of an infinite series of terms.
Thus, a Taylor finite expansion only is an approximation of the true result: as the number of terms increases, the accuracy of this approximation improves.
If there are enough terms, the approximation error at some point becomes unnoticeable. However, if you try lowering the number of terms, the approximation error increases and can detected.
In your case, the approximation error is below the detection threshold for NUMBER_OF_TERMS= 15, but becomes noticeable when NUMBER_OF_TERMS= 10 or less.
The Taylor expansions of sin(x) and cos(x) takes longer to converge as x increases. But since these are periodic functions, you don't actually need to bother expanding the series for values outside the range 0-90°.
For values of x outside this range, use the following identities:
sin(x) = -sin(x+180°) = -sin(-x) = sin(180°-x)
cos(x) = -cos(x+180°) = cos(-x) = -cos(180°-x)
For example, sin(175°) = sin(5°), cos(-120°) = -cos(60°)
i figured it out with help from another user.
turns out i was calculating terms + 1, making the answer more accurate than intended. after 15 terms the changes are past the 12th decimal point, and therefore did not show on the results.
#include <stdio.h>
#define PI 3.141592653589
#define NUMBER_OF_TERMS 10 // 5 and 15 work as well
int
main()
{
int cosctr, sinctr;
double ctr, radi;
double cosaccu, costerm, sinaccu, sinterm; // accu will be final answer, term will be added to accu
for (ctr = -180; ctr < 185; ctr+=5) { // for loop; ctr initialized at -185 and added to in increments of 5 to produce degrees
radi = ctr * PI/180.0; // calculation for radians (assigned to radi)
cosctr = 1; // initialization for cos counter; must be included in loop to allow correct calculations of cos
cosaccu = 1; // first term is 1
costerm = 1; // base term, to be multiplied with termcalc formula
sinctr = 2; // initialization for sin counter; must be included in loop to allow correct calculations of sin
sinaccu = radi; // first term is x, or degrees in radians (radi)
sinterm = radi; // base term for sin
// cos calculation
while (cosctr < 2*NUMBER_OF_TERMS-1) { // accuracy check, 2* since increments of 2; NOTE: actual values are (1, 3, 5,...)
costerm = costerm*(-1)*(radi*radi)/(cosctr*(cosctr + 1)); // TERMCALC FORMULA; multiplying previous term with formula creates next term
cosaccu = cosaccu + costerm; // addition of new term to previous sum; dependent on accuracy (NUMBER_OF_TERMS)
cosctr+=2;
} do { // sin calculation; identical to cos, albeit with substituted vars
sinterm = sinterm*(-1)*(radi*radi)/(sinctr*(sinctr + 1));
sinaccu = sinaccu + sinterm;
sinctr+=2;
} while (sinctr < 2*NUMBER_OF_TERMS-1); // accuracy check, 2* since increments of 2; NOTE: actual values are (2, 4, 6,...)
printf("%.2lf\t%.12lf\t%.12lf\t%.12lf\n", ctr, radi, cosaccu, sinaccu); // final display; /t used for convenience
} return 0; // finally!!!
}
I'm working on a project for our school and we are required to create a program that computes the approximation of the Taylor Expansion Series of sin x and cos x, only using <stdio.h> and without user-defined functions other than int main(), of all angles from -180 to 180 in increments of +5. the following is my code:
#include <stdio.h>
#define PI 3.141592653589
#define NUMBER_OF_TERMS 10
int
main()
{
int cosctr, sinctr;
double ctr, radi;
double cosaccu, costerm, sinaccu, sinterm;
for (ctr = -180; ctr < 185; ctr = ctr + 5) {
radi = ctr * PI/180.0;
cosctr = 1;
cosaccu = 1;
costerm = 1;
sinctr = 2;
sinaccu = radi;
sinterm = radi;
while (cosctr <= 2*NUMBER_OF_TERMS) {
costerm = costerm*(-1)*(radi*radi)/(cosctr*(cosctr + 1));
cosaccu = cosaccu + costerm;
cosctr+=2;
} do {
sinterm = sinterm*(-1)*(radi*radi)/(sinctr*(sinctr + 1));
sinaccu = sinaccu + sinterm;
sinctr+=2;
} while (sinctr <= 2*NUMBER_OF_TERMS);
printf("%.2lf %.12lf %.12lf %.12lf\n", ctr, radi, cosaccu, sinaccu);
} return 0;
}
The code above is accurate for a 15 terms expansion approximation. however, if I change NUMBER_OF_TERMS to, for example, 5 or 10, the approximation is flawed.
Any suggestions?
Let me clarify: I need to obtain an approximation of 5 terms, 10 terms, and 15 terms. I cannot use any other library other than <stdio.h>. I cannot use any other functions outside of int main() (I apologize for the vagueness of my explanation before).
Please answer with the included corrected code.
The key to high precession, yet simple calculation of sind(degrees) and cosd(degrees) is to reduce the range of degree to 0 to 90 first (or even 0 to 45), using the usual trigonometric adjustments with degree arrangements first.
Reductions:
angle = fmod(angle, 360) // reduce (-360..360) or use a = a - (int)(a/360)
sin(x) = -sin(-x) // reduce to [0..360)
cos(x) = cos(-x) // reduce to [0..360)
sin(x) = -sin(x-180) // reduce to [0..180)
cos(x) = -cos(x-180) // reduce to [0..180)
sin(x) = cos(90-x) // reduce to [0..90)
Further reductions:
For [45-90) use sin(x) = cos(90-x) // reduce to [0..45)
then convert to radians and use Taylor series expansion.
Example
Note: Since code is dealing with double, typically 17 digits of precision, no need to use a course PI approximation.
// #define PI 3.141592653589
#define PI 3.1415926535897932384626433832795
I tried your code; it works fine for me, in that it does what it looks like it's designed to do. Here's a comparison between your code's output for the cosine at 5 and 10 terms and the same approximation as calculated by Mathematica. They agree up to <10^-12, i.e. your outputted precision.:
The only problem I see with your code is that, with the way you designed your loops, you're actually taking into account NUMBER_OF_TERMS + 1 terms if you count the first terms in the expansion (i.e. the constant term for the cosine, the linear term for the sine.) You start with this first term, and then your loop adds another NUMMBER_OF_TERMS terms. If that is not by design, you're actually approximating the functions with higher precision that you are expecting.
By its very definition, a Taylor series is a summation of an infinite series of terms.
Thus, a Taylor finite expansion only is an approximation of the true result: as the number of terms increases, the accuracy of this approximation improves.
If there are enough terms, the approximation error at some point becomes unnoticeable. However, if you try lowering the number of terms, the approximation error increases and can detected.
In your case, the approximation error is below the detection threshold for NUMBER_OF_TERMS= 15, but becomes noticeable when NUMBER_OF_TERMS= 10 or less.
The Taylor expansions of sin(x) and cos(x) takes longer to converge as x increases. But since these are periodic functions, you don't actually need to bother expanding the series for values outside the range 0-90°.
For values of x outside this range, use the following identities:
sin(x) = -sin(x+180°) = -sin(-x) = sin(180°-x)
cos(x) = -cos(x+180°) = cos(-x) = -cos(180°-x)
For example, sin(175°) = sin(5°), cos(-120°) = -cos(60°)
i figured it out with help from another user.
turns out i was calculating terms + 1, making the answer more accurate than intended. after 15 terms the changes are past the 12th decimal point, and therefore did not show on the results.
#include <stdio.h>
#define PI 3.141592653589
#define NUMBER_OF_TERMS 10 // 5 and 15 work as well
int
main()
{
int cosctr, sinctr;
double ctr, radi;
double cosaccu, costerm, sinaccu, sinterm; // accu will be final answer, term will be added to accu
for (ctr = -180; ctr < 185; ctr+=5) { // for loop; ctr initialized at -185 and added to in increments of 5 to produce degrees
radi = ctr * PI/180.0; // calculation for radians (assigned to radi)
cosctr = 1; // initialization for cos counter; must be included in loop to allow correct calculations of cos
cosaccu = 1; // first term is 1
costerm = 1; // base term, to be multiplied with termcalc formula
sinctr = 2; // initialization for sin counter; must be included in loop to allow correct calculations of sin
sinaccu = radi; // first term is x, or degrees in radians (radi)
sinterm = radi; // base term for sin
// cos calculation
while (cosctr < 2*NUMBER_OF_TERMS-1) { // accuracy check, 2* since increments of 2; NOTE: actual values are (1, 3, 5,...)
costerm = costerm*(-1)*(radi*radi)/(cosctr*(cosctr + 1)); // TERMCALC FORMULA; multiplying previous term with formula creates next term
cosaccu = cosaccu + costerm; // addition of new term to previous sum; dependent on accuracy (NUMBER_OF_TERMS)
cosctr+=2;
} do { // sin calculation; identical to cos, albeit with substituted vars
sinterm = sinterm*(-1)*(radi*radi)/(sinctr*(sinctr + 1));
sinaccu = sinaccu + sinterm;
sinctr+=2;
} while (sinctr < 2*NUMBER_OF_TERMS-1); // accuracy check, 2* since increments of 2; NOTE: actual values are (2, 4, 6,...)
printf("%.2lf\t%.12lf\t%.12lf\t%.12lf\n", ctr, radi, cosaccu, sinaccu); // final display; /t used for convenience
} return 0; // finally!!!
}
I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?
I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:
int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
#include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
const uint32_t outfmt = 32; // final output format in s0.outfmt
const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle
const uint32_t pds=ds-lutsp; // division shift and post-division shift
uint32_t lutind, p, t, d;
int32_t a, b, xa, ya, xs, ys, div, r;
xs = x >> 31; // equivalent of fabs()
xa = (x^xs) - xs;
ys = y >> 31;
ya = (y^ys) - ys;
d = ya+xa;
if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0
return 0;
// the following does 0.5 * (1. - (y-x) / (y+x))
// (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29
lutind = p >> ish; // index for the LUT
t = (p & tmask) >> tbd; // interpolator between two LUT entries
a = fpatan_lut[lutind];
b = fpatan_lut[lutind+1];
r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format
// Quadrants
if (xs) // if x was negative
r = halfof - r; // r = 0.5 - r
r = (r^ys) - ys; // if y was negative then r is negated
return r;
}
Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.
If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.
Based on #Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.
I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.
Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
int32_t y;
y = x>>21; if (y) return ffo_lut[y]+21;
y = x>>10; if (y) return ffo_lut[y]+10;
return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9
{
#include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
const int32_t *c;
int32_t xa, xs, p, sh;
uint32_t expon, frx, lutind;
const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5
int64_t out;
int64_t c0, c1, c2;
// turn x into xa (|x|) and sign of x (xs)
xs = x >> 31;
xa = (x^xs) - xs;
// decompose |x| into frx * 2^expon
expon = log2_ffo32(xa);
frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format
// lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
lutind *= 3; // 3 entries for each index
c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2
c0 = c[0]; c1 = c[1]; c2 = c[2];
p = (int64_t) frx * frx >> prec; // x^2
p = c2 * p >> prec; // c2 * x^2
p += c1 * frx >> prec; // + c1 * x
p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format
// apply the necessary bit shifts and reapplies the original sign of x to make final result
sh = expon + prec - outfmt; // calculates the final needed shift
out = (int64_t) y * p; // format is s31 + 1.27 = s32.27
if (sh >= 0)
out >>= sh;
else
out <<= -sh;
out = (out^xs) - xs; // if x was negative then out is negated
return out;
}
I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.
As for the fpatan2() question the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with that line:
div = fpdiv(ya-xa, d, ds);
Yours time hog instruction:
div = ((int64_t) (ya-xa)<<ds) / d;
exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:
21: 48 89 c2 mov %rax,%rdx
24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension
28: 48 f7 fe idiv %rsi
Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:
uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
"cc");
this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:
18: 48 98 cltq
1a: 48 c1 e0 1d shl $0x1d,%rax
1e: 48 89 c2 mov %rax,%rdx
21: 48 c1 ea 20 shr $0x20,%rdx
27: f7 f9 idiv %ecx
This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.
More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:
uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (lowpart), "d" (highpart), [d] "q" (d) :
"cc");
which results in:
18: 89 d0 mov %edx,%eax
1a: c1 e0 1d shl $0x1d,%eax
1d: c1 ea 03 shr $0x3,%edx
22: f7 f9 idiv %ecx
but this is minor fix, compared to the division-related one.
To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.
Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.
Assuming constants defined for pi abd pi/2:
int32_t fpatan2( int32_t y, int32_t x)
{
fixed theta ;
if( x == 0 )
{
theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
theta = fpatan( y / x ) ;
if( x < 0 )
{
theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
}
}
return theta ;
}
Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.
I'm looking for implementation of log() and exp() functions provided in C library <math.h>. I'm working with 8 bit microcontrollers (OKI 411 and 431). I need to calculate Mean Kinetic Temperature. The requirement is that we should be able to calculate MKT as fast as possible and with as little code memory as possible. The compiler comes with log() and exp() functions in <math.h>. But calling either function and linking with the library causes the code size to increase by 5 Kilobytes, which will not fit in one of the micro we work with (OKI 411), because our code already consumed ~12K of available ~15K code memory.
The implementation I'm looking for should not use any other C library functions (like pow(), sqrt() etc). This is because all library functions are packed in one library and even if one function is called, the linker will bring whole 5K library to code memory.
EDIT
The algorithm should be correct up to 3 decimal places.
Using Taylor series is not the simplest neither the fastest way of doing this. Most professional implementations are using approximating polynomials. I'll show you how to generate one in Maple (it is a computer algebra program), using the Remez algorithm.
For 3 digits of accuracy execute the following commands in Maple:
with(numapprox):
Digits := 8
minimax(ln(x), x = 1 .. 2, 4, 1, 'maxerror')
maxerror
Its response is the following polynomial:
-1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x
With the maximal error of: 0.000061011436
We generated a polynomial which approximates the ln(x), but only inside the [1..2] interval. Increasing the interval is not wise, because that would increase the maximal error even more. Instead of that, do the following decomposition:
So first find the highest power of 2, which is still smaller than the number (See: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?). That number is actually the base-2 logarithm. Divide with that value, then the result gets into the 1..2 interval. At the end we will have to add n*ln(2) to get the final result.
An example implementation for numbers >= 1:
float ln(float y) {
int log2;
float divisor, x, result;
log2 = msb((int)y); // See: https://stackoverflow.com/a/4970859/6630230
divisor = (float)(1 << log2);
x = y / divisor; // normalized value between [1.0, 2.0]
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
result += ((float)log2) * 0.69314718; // ln(2) = 0.69314718
return result;
}
Although if you plan to use it only in the [1.0, 2.0] interval, then the function is like:
float ln(float x) {
return -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
}
The Taylor series for e^x converges extremely quickly, and you can tune your implementation to the precision that you need. (http://en.wikipedia.org/wiki/Taylor_series)
The Taylor series for log is not as nice...
If you don't need floating-point math for anything else, you may compute an approximate fractional base-2 log pretty easily. Start by shifting your value left until it's 32768 or higher and store the number of times you did that in count. Then, repeat some number of times (depending upon your desired scale factor):
n = (mult(n,n) + 32768u) >> 16; // If a function is available for 16x16->32 multiply
count<<=1;
if (n < 32768) n*=2; else count+=1;
If the above loop is repeated 8 times, then the log base 2 of the number will be count/256. If ten times, count/1024. If eleven, count/2048. Effectively, this function works by computing the integer power-of-two logarithm of n**(2^reps), but with intermediate values scaled to avoid overflow.
Would basic table with interpolation between values approach work? If ranges of values are limited (which is likely for your case - I doubt temperature readings have huge range) and high precisions is not required it may work. Should be easy to test on normal machine.
Here is one of many topics on table representation of functions: Calculating vs. lookup tables for sine value performance?
Necromancing.
I had to implement logarithms on rational numbers.
This is how I did it:
Occording to Wikipedia, there is the Halley-Newton approximation method
which can be used for very-high precision.
Using Newton's method, the iteration simplifies to (implementation), which has cubic convergence to ln(x), which is way better than what the Taylor-Series offers.
// Using Newton's method, the iteration simplifies to (implementation)
// which has cubic convergence to ln(x).
public static double ln(double x, double epsilon)
{
double yn = x - 1.0d; // using the first term of the taylor series as initial-value
double yn1 = yn;
do
{
yn = yn1;
yn1 = yn + 2 * (x - System.Math.Exp(yn)) / (x + System.Math.Exp(yn));
} while (System.Math.Abs(yn - yn1) > epsilon);
return yn1;
}
This is not C, but C#, but I'm sure anybody capable to program in C will be able to deduce the C-Code from that.
Furthermore, since
logn(x) = ln(x)/ln(n).
You have therefore just implemented logN as well.
public static double log(double x, double n, double epsilon)
{
return ln(x, epsilon) / ln(n, epsilon);
}
where epsilon (error) is the minimum precision.
Now as to speed, you're probably better of using the ln-cast-in-hardware, but as I said, I used this as a base to implement logarithms on a rational numbers class working with arbitrary precision.
Arbitrary precision might be more important than speed, under certain circumstances.
Then, use the logarithmic identities for rational numbers:
logB(x/y) = logB(x) - logB(y)
In addition to Crouching Kitten's answer which gave me inspiration, you can build a pseudo-recursive (at most 1 self-call) logarithm to avoid using polynomials. In pseudo code
ln(x) :=
If (x <= 0)
return NaN
Else if (!(1 <= x < 2))
return LN2 * b + ln(a)
Else
return taylor_expansion(x - 1)
This is pretty efficient and precise since on [1; 2) the taylor series converges A LOT faster, and we get such a number 1 <= a < 2 with the first call to ln if our input is positive but not in this range.
You can find 'b' as your unbiased exponent from the data held in the float x, and 'a' from the mantissa of the float x (a is exactly the same float as x, but now with exponent biased_0 rather than exponent biased_b). LN2 should be kept as a macro in hexadecimal floating point notation IMO. You can also use http://man7.org/linux/man-pages/man3/frexp.3.html for this.
Also, the trick
unsigned long tmp = *(ulong*)(&d);
for "memory-casting" double to unsigned long, rather than "value-casting", is very useful to know when dealing with floats memory-wise, as bitwise operators will cause warnings or errors depending on the compiler.
Possible computation of ln(x) and expo(x) in C without <math.h> :
static double expo(double n) {
int a = 0, b = n > 0;
double c = 1, d = 1, e = 1;
for (b || (n = -n); e + .00001 < (e += (d *= n) / (c *= ++a)););
// approximately 15 iterations
return b ? e : 1 / e;
}
static double native_log_computation(const double n) {
// Basic logarithm computation.
static const double euler = 2.7182818284590452354 ;
unsigned a = 0, d;
double b, c, e, f;
if (n > 0) {
for (c = n < 1 ? 1 / n : n; (c /= euler) > 1; ++a);
c = 1 / (c * euler - 1), c = c + c + 1, f = c * c, b = 0;
for (d = 1, c /= 2; e = b, b += 1 / (d * c), b - e/* > 0.0000001 */;)
d += 2, c *= f;
} else b = (n == 0) / 0.;
return n < 1 ? -(a + b) : a + b;
}
static inline double native_ln(const double n) {
// Returns the natural logarithm (base e) of N.
return native_log_computation(n) ;
}
static inline double native_log_base(const double n, const double base) {
// Returns the logarithm (base b) of N.
return native_log_computation(n) / native_log_computation(base) ;
}
Try it Online
Building off #Crouching Kitten's great natural log answer above, if you need it to be accurate for inputs <1 you can add a simple scaling factor. Below is an example in C++ that i've used in microcontrollers. It has a scaling factor of 256 and it's accurate to inputs down to 1/256 = ~0.04, and up to 2^32/256 = 16777215 (due to overflow of a uint32 variable).
It's interesting to note that even on an STMF103 Arm M3 with no FPU, the float implementation below is significantly faster (eg 3x or better) than the 16 bit fixed-point implementation in libfixmath (that being said, this float implementation still takes a few thousand cycles so it's still not ~fast~)
#include <float.h>
float TempSensor::Ln(float y)
{
// Algo from: https://stackoverflow.com/a/18454010
// Accurate between (1 / scaling factor) < y < (2^32 / scaling factor). Read comments below for more info on how to extend this range
float divisor, x, result;
const float LN_2 = 0.69314718; //pre calculated constant used in calculations
uint32_t log2 = 0;
//handle if input is less than zero
if (y <= 0)
{
return -FLT_MAX;
}
//scaling factor. The polynomial below is accurate when the input y>1, therefore using a scaling factor of 256 (aka 2^8) extends this to 1/256 or ~0.04. Given use of uint32_t, the input y must stay below 2^24 or 16777216 (aka 2^(32-8)), otherwise uint_y used below will overflow. Increasing the scaing factor will reduce the lower accuracy bound and also reduce the upper overflow bound. If you need the range to be wider, consider changing uint_y to a uint64_t
const uint32_t SCALING_FACTOR = 256;
const float LN_SCALING_FACTOR = 5.545177444; //this is the natural log of the scaling factor and needs to be precalculated
y = y * SCALING_FACTOR;
uint32_t uint_y = (uint32_t)y;
while (uint_y >>= 1) // Convert the number to an integer and then find the location of the MSB. This is the integer portion of Log2(y). See: https://stackoverflow.com/a/4970859/6630230
{
log2++;
}
divisor = (float)(1 << log2);
x = y / divisor; // FInd the remainder value between [1.0, 2.0] then calculate the natural log of this remainder using a polynomial approximation
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x; //This polynomial approximates ln(x) between [1,2]
result = result + ((float)log2) * LN_2 - LN_SCALING_FACTOR; // Using the log product rule Log(A) + Log(B) = Log(AB) and the log base change rule log_x(A) = log_y(A)/Log_y(x), calculate all the components in base e and then sum them: = Ln(x_remainder) + (log_2(x_integer) * ln(2)) - ln(SCALING_FACTOR)
return result;
}