newton raphson in C - c

I have implemented the newton raphson algorithm for finding roots in C. I want to print out the most accurate approximation of the root as possible without going into nan land. My strategy for this is while (!(isnan(x0)) { dostuff(); } But this continues to print out the result multiple times. Ideally I would like to setup a range so that the difference between each computed x intercept approximation would stop when the previous - current is less than some range .000001 in my case. I have a possible implementation below. When I input 2.999 It takes only one step, but when I input 3.0 it takes 20 steps, this seems incorrect to me.
(When I input 3.0)
λ newton_raphson 3
2.500000
2.250000
2.125000
2.062500
2.031250
2.015625
2.007812
2.003906
2.001953
2.000977
2.000488
2.000244
2.000122
2.000061
2.000031
2.000015
2.000008
2.000004
2.000002
2.000001
Took 20 operation(s) to approximate a proper root of 2.000002
within a range of 0.000001
(When I input 2.999)
λ newton_raphson 2.999
Took 1 operation(s) to approximate a proper root of 2.000000
within a range of 0.000001
My code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define RANGE 0.000001
double absolute(double number)
{
if (number < 0) return -number;
else return number;
}
double newton_raphson(double (*func)(double), double (*derivative)(double), double x0){
int count;
double temp;
count = 0;
while (!isnan(x0)) {
temp = x0;
x0 = (x0 - (func(x0)/derivative(x0)));
if (!isnan(x0))
printf("%f\n", x0);
count++;
if (absolute(temp - x0) < RANGE && count > 1)
break;
}
printf("Took %d operation(s) to approximate a proper root of %6f\nwithin a range of 0.000001\n", count, temp);
return x0;
}
/* (x-2)^2 */
double func(double x){ return pow(x-2.0, 2.0); }
/* 2x-4 */
double derivative(double x){ return 2.0*x - 4.0; }
int main(int argc, char ** argv)
{
double x0 = atof(argv[1]);
double (*funcPtr)(double) = &func; /* this is a user defined function */
double (*derivativePtr)(double) = &derivative; /* this is the derivative of that function */
double result = newton_raphson(funcPtr, derivativePtr, x0);
return 0;
}

You call trunc(x0) which turns 2.999 into 2.0. Naturally, when you start at the right answer, no iteration is needed! In other words, although you intended to use 2.999 as your starting value, you actually used 2.0.
Simply remove the call to trunc().

Worth pointing out: taking 20 steps to converge is also anomalous; because you are converging to a multiple root, the convergence is only linear instead of the typical quadratic convergence that Newton-Raphson gives in the general case. You can see this in the fact that your error is halved with each iteration (with the usual quadratic convergence, you would get twice as many correct digits on each iteration, and converge much, much faster).

Related

C program, can't for the life of me figure it out

Ok so I have to increment a function into my code to get it to load a bunch of numbers that eventually will reach the sqrt of the number that is input my the user, all by using a while loop. The problem is, the number does not go into the function, and it loops indefinitely because the false is never reached. Any help?
#include <stdio.h>
#include <math.h>
int main(void)
{
double in, out, var, new_guess, old_guess;
printf("Enter a number: ");
scanf("%lf", &in);
while(fabs(in - sqrt(old_guess)) >= 1e-5) {
new_guess = (old_guess + (in / old_guess)) / 2.0;
printf("%11.5lf", old_guess);
}
printf("Estimated square root of %11.5lf: %11.5lf\n", in, new_guess);
return 0;
}
Once you get all your syntax issues resolved, you will still never get the desired result because the math in your predictor/corrector method will never converge. Specifically, fabs(in - sqrt(old_guess)) will always be >= 1e-5 as in will always be greater than the sqrt of old_guess.
Further, if you are using a predictor/corrector method to compute the square root of a number, it rather defeats the purpose to use sqrt in the iteration. If you were going to use the sqrt function to find the answer, you could simply do:
double answer = sqrt (in); /* problem solved */
The purpose of an iterative method is to converge on a solution by using either a rate or average difference to repeatedly refine your guess until it satisfies some condition like a error tolerance between repeated terms (which it appears you are attempting to do here)
To iteratively find the square root of a number using the method you are attempting to use, you first find the next lower or higher perfect square of the number entered by the user. A simple brute force of starting at 1 and incrementing x until x * x is no longer less than in is fine.
You then divide the input by the perfect square to predict the answer, and then take the average of the input divided by the predicted answer plus the predicted answer to correct for error between the terms (and repeat until your error tolerance is reached)
note you should also include an iteration limit to prevent against an endless loop if your solution does not converge for some reason.
Putting it altogether, you could do something similar to:
#include <stdio.h>
#include <math.h>
#define ILIM 64 /* max iteration limit */
#define TOL 1e-5 /* tolerance */
int main(void)
{
double in, n = 0, new_guess, old_guess, root = 1;
printf ("Enter a number: ");
if (scanf ("%lf", &in) != 1) {
fprintf (stderr, "error: invalid input.\n");
return 1;
}
while (root * root < in) /* find next larger perfect square */
root++;
/* compute initial old/new_guess */
old_guess = (in / root + root) / 2.0;
new_guess = (in / old_guess + old_guess) / 2.0;
/* compare old/new_guess, repeat until limit or tolerance met */
while (n++ < ILIM && fabs (new_guess - old_guess) >= TOL) {
old_guess = new_guess;
new_guess = (in / old_guess + old_guess) / 2.0;
}
printf ("Estimated square root of %.5f: %.5f\n", in, new_guess);
printf ("Actual : %.5f\n", sqrt (in));
return 0;
}
(note: sqrt is only used to provide a comparison with your iterative solution)
Example Use/Output
$ ./bin/sqrthelp
Enter a number: 9
Estimated square root of 9.00000: 3.00000
Actual : 3.00000
$ ./bin/sqrthelp
Enter a number: 9.6
Estimated square root of 9.60000: 3.09839
Actual : 3.09839
$ ./bin/sqrthelp
Enter a number: 10
Estimated square root of 10.00000: 3.16228
Actual : 3.16228
$ ./bin/sqrthelp
Enter a number: 24
Estimated square root of 24.00000: 4.89898
Actual : 4.89898
$ ./bin/sqrthelp
Enter a number: 25
Estimated square root of 25.00000: 5.00000
Actual : 5.00000
$ ./bin/sqrthelp
Enter a number: 30
Estimated square root of 30.00000: 5.47723
Actual : 5.47723

I do *not* want correct rounding for function exp

The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:
(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.
It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.
(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2-44.
Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)
… But how to do this??
(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.
Here is a minimal program showing the phenomenon:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
{
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
{
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
}
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
}
Now after gcc -std=c99 program53.c -lm -o program53:
$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53
1.000000e+06
Clock time spent: 13292721
$ ./program53
1.000000e+06
Clock time spent: 13201616
On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):
$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612
Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.
P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/
To answer the general question on why the library functions are required to give correctly rounded results:
Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.
In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:
Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.
A similar concern exists for computing log(1 + x), for which there is the function log1p(x).
A quick fix that speeds up the provided testcase:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
{
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
{
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += expm1 (a) + 1; // replace exp() with expm1() + 1
}
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
}
For this case, the timings on my machine are thus:
Original code
1.000000e+06
Clock time spent: 21543338
Modified code
1.000000e+06
Clock time spent: 55076
Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical
For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.
Note that a better description of the problem would enable a better answer.
Regarding your comment to #EOF 's answer, the "write your own" remark from #NominalAnimal seems simple enough here, even trivial, as follows.
Your original code above seems to have a max possible argument for exp() of a=(1+2*0x400)/0x2000...=4.55e-13 (that should really be 2*0x3FF, and I'm counting 13 zeroes after your 0x2000... which makes it 2x16^13). So that 4.55e-13 max argument is very, very small.
And then the trivial taylor expansion is exp(a)=1+a+(a^2)/2+(a^3)/6+... which already gives you all double's precision for such small arguments. Now, you'll have to discard the 1 part, as explained above, and then that just reduces to expm1(a)=a*(1.+a*(1.+a/3.)/2.) And that should go pretty darn quick! Just make sure a stays small. If it gets a little bigger, just add the next term, a^4/24 (you see how to do that?).
>>EDIT<<
I modified the OP's test program as follows to test a little more stuff (discussion follows code)
/* https://stackoverflow.com/questions/44346371/
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16 /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
multiplier = (argc>2?atoi(argv[2]):2),
isexp = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
double a, c=0.0, cm1=0.0, tm1=0.0;
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
n++;
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
n, multiplier,BASE,EXPON,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
c,c-(double)n,cm1,tm1);
return 0;
} /* --- end-of-function main() --- */
Compile and run it as test to reproduce OP's 0x2000... scenario, or run it with (up to three) optional args test #trials multiplier timeexp where #trials defaults to the OP's 1000000, and multipler defaults to 2 for the OP's 2x16^13 (change it to 4, etc, for her other tests). For the last arg, timeexp, enter a 0 to do only the expm1() (and my unnecessary taylor-like) calculation. The point of that is to show that the bad-timing-cases displayed by the OP disappear with expm1(), which takes "no time at all" regardless of multiplier.
So default runs, test and test 1000000 4, produce (okay, I called the program rounding)...
bash-4.3$ ./rounding
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
c=1.00000000000000023283e+06,
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
c=1.00000000000000011642e+06,
c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08
So the first thing you'll note is that the OP's c-n using exp() differs substantially from both cm1==tm1 using expm1() and my taylor approx. If you reduce N they come into agreement, as follows...
N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
c=1.00000000000007140954e+01,
c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
c=1.00000000000010103918e+02,
c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
c=1.00000000000011345946e+03,
c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
c=1.00000000000002328306e+04,
c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
c=1.00000000000000232831e+05,
c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08
And as far as timing of exp() versus expm1() is concerned, see for yourself...
bash-4.3$ ./rounding 1000000 2
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
c=1.00000000000000023283e+06,
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
c=0.00000000000000000000e+00,
c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07
Question: you'll note that once the exp() calculation reaches N=10000 trials, its sum remains constant regardless of larger N. Not sure why that would be happening.
>>__SECOND EDIT__<<
Okay, #EOF , "you made me look" with your "heirarchical accumulation" comment. And that indeed works to bring the exp() sum closer (much closer) to the (presumably correct) expm1() sum. The modified code's immediately below followed by a discussion. But one discussion note here: recall multiplier from above. That's gone, and in its same place is expon so that denominator is now 2^expon where the default is 53, matching OP's default (and I believe better matching how she was thinking about it). Okay, and here's the code...
/* https://stackoverflow.com/questions/44346371/
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2 /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
expon = (argc>2?atoi(argv[2]):EXPON),
isexp = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
ncparts = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
binsize = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = pow((double)BASE,(double)expon);
double a, c=0.0, cm1=0.0, tm1=0.0;
double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
int nbins[10], ibin=0; /* start at lowest level */
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
if ( ncparts > 65536 ) ncparts=65536; /* array size check */
if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
n++;
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) { /* turn this off to time expm1() alone */
double expa = exp(a); /* exp(a) */
c += expa; /* just accumulate in a single "bin" */
if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
if ( ncparts > 1 ) { /* need to sum the partial-sum bins */
nbins[ibin=0] = ncparts; /* lowest-level has everything */
while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
if ( ibin >= 9 ) break; /* no more bins */
ibin++; /* next available heirarchy bin level */
nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
for(i=0;i<nbins[ibin-1];i++) {
cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
} /* --- end-of-while(nprevbins>binsize) --- */
for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
} /* --- end-of-if(ncparts>1) --- */
printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
c,c-(double)n,cm1,tm1);
if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
for (i=0;i<=ibin;i++) /* display heirarchy */
printf("\t level#%d: #bins=%5d, c-n=%e\n",
i,nbins[i],csums[i]-(double)n); }
return 0;
} /* --- end-of-function main() --- */
Okay, and now you can notice two additional command-line args following the old timeexp. They are ncparts for the initial number of bins into which the entire #trials will be distributed. So at the lowest level of the heirarchy, each bin should (modulo bugs:) have the sum of #trials/ncparts doubles. The argument after that is binsize, which will be the number of doubles summed in each bin at every successive level, until the last level has fewer (or equal) #bins as binsize. So here's an example dividing 1000000 trials into 50000 bins, meaning 20doubles/bin at the lowest level, and 5doubles/bin thereafter...
bash-4.3$ ./rounding 1000000 53 1 50000 5
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
c=1.00000000000000465661e+06,
c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
binsize=5...
level#0: #bins=50000, c-n=4.656613e-09
level#1: #bins=10002, c-n=1.734588e-08
level#2: #bins= 2002, c-n=7.974450e-08
level#3: #bins= 402, c-n=1.059379e-07
level#4: #bins= 82, c-n=1.133885e-07
level#5: #bins= 18, c-n=1.136214e-07
level#6: #bins= 5, c-n=1.138542e-07
Note how the c-n for exp() converges pretty nicely towards the expm1() value. But note how it's best at level#5, and isn't converging uniformly at all. And note if you break the #trials into only 5000 initial bins, you get just as good a result,
bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
c=1.00000000000003527384e+06,
c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
binsize=5...
level#0: #bins= 5000, c-n=3.527384e-08
level#1: #bins= 1002, c-n=1.164153e-07
level#2: #bins= 202, c-n=1.158332e-07
level#3: #bins= 42, c-n=1.136214e-07
level#4: #bins= 10, c-n=1.137378e-07
level#5: #bins= 4, c-n=1.136214e-07
In fact, playing with ncparts and binsize doesn't seem to show much sensitivity, and it's not always "more is better" (i.e., less for binsize) either. So I'm not sure exactly what's going on. Could be a bug (or two), or could be yet another question for #EOF ...???
>>EDIT -- example showing pair addition "binary tree" heirarchy<<
Example below added as per #EOF 's comment
(Note: re-copy preceding code. I had to edit nbins[ibin] calculation for each next level to nbins[ibin]=(nbins[ibin-1]+(binsize-1))/binsize; from nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize; which was "too conservative" to create ...16,8,4,2 sequence)
bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
c=1.02400000000011573320e+03,
c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
binsize=2...
level#0: #bins= 512, c-n=1.159606e-10
level#1: #bins= 256, c-n=1.166427e-10
level#2: #bins= 128, c-n=1.166427e-10
level#3: #bins= 64, c-n=1.161879e-10
level#4: #bins= 32, c-n=1.166427e-10
level#5: #bins= 16, c-n=1.166427e-10
level#6: #bins= 8, c-n=1.166427e-10
level#7: #bins= 4, c-n=1.166427e-10
level#8: #bins= 2, c-n=1.164153e-10
>>EDIT -- to show #EOF's elegant solution in comment below<<
"Pair addition" can be elegantly accomplished recursively, as per #EOF's comment below, which I'm reproducing here. (Note case 0/1 at end-of-recursion to handle n even/odd.)
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left.
(Note that this will require n/2 elements of storage,
rather than a fixed number of bins like your implementation) */
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
return(sum);
}
This is an "answer"/followup to EOF's preceding comments re his trecu() algorithm and code for his "binary tree summation" suggestion. "Prerequisites" before reading this are reading that discussion. It would be nice to collect all that in one organized place, but I haven't done that yet...
...What I did do was build EOF's trecu() into the test program from the preceding answer that I'd written by modifying the OP's original test program. But then I found that trecu() generated exactly (and I mean exactly) the same answer as the "plain sum" c using exp(), not the sum cm1 using expm1() that we'd expected from a more accurate binary tree summation.
But that test program's a bit (maybe two bits:) "convoluted" (or, as EOF said, "unreadable"), so I wrote a separate smaller test program, given below (with example runs and discussion below that), to separately test/exercise trecu(). Moreover, I also wrote function bintreesum() into the code below, which abstracts/encapsulates the iterative code for binary tree summation that I'd embedded into the preceding test program. In that preceding case, my iterative code indeed came close to the cm1 answer, which is why I'd expected EOF's recursive trecu() to do the same. Long-and-short of it is that, below, same thing happens -- bintreesum() remains close to correct answer, while trecu() gets further away, exactly reproducing the "plain sum".
What we're summing below is just sum(i),i=1...n, which is just the well-known n(n+1)/2. But that's not quite right -- to reproduce OP's problem, summand is not sum(i) alone but rather sum(1+i*10^(-e)), where e can be given on the command-line. So for, say, n=5, you don't get 15 but rather 5.000...00015, or for n=6 you get 6.000...00021, etc. And to avoid a long, long format, I printf() sum-n to remove that integer part. Okay??? So here's the code...
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
return(sum);
} /* --- end-of-function trecu() --- */
double bintreesum(double *vals, int n, int binsize) {
double binsum = 0.0;
int nbin0 = (n+(binsize-1))/binsize,
nbin1 = (nbin0+(binsize-1))/binsize,
nbins[2] = { nbin0, nbin1 };
double *vbins[2] = {
(double *)malloc(nbin0*sizeof(double)),
(double *)malloc(nbin1*sizeof(double)) },
*vbin0=vbins[0], *vbin1=vbins[1];
int ibin=0, i;
for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
while ( nbins[ibin] > 1 ) {
int jbin = 1-ibin; /* other bin, 0<-->1 */
nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
for ( i=0; i<nbins[ibin]; i++ )
vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
ibin = jbin; /* swap bins for next pass */
} /* --- end-of-while(nbins[ibin]>0) --- */
binsum = vbins[ibin][0];
free((void *)vbins[0]); free((void *)vbins[1]);
return ( binsum );
} /* --- end-of-function bintreesum() --- */
#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
int N = (argc>1? atoi(argv[1]) : 1000000 ),
e = (argc>2? atoi(argv[2]) : -10 ),
binsize = (argc>3? atoi(argv[3]) : 2 );
double tens = pow(10.0,(double)e);
double *vals = (double *)malloc(sizeof(double)*MAXN),
sum = 0.0;
double trecu(), bintreesum();
int i;
if ( N > MAXN ) N=MAXN;
for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
for ( i=0; i<N; i++ ) sum += vals[i];
printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N = %.8e,\n"
"\t plain_sum-N = %.8e,\n"
"\t trecu-N = %.8e,\n"
"\t bintreesum-N = %.8e \n",
N, tens, tens*((double)N)*((double)(N+1))/2.0,
sum-(double)N,
trecu(vals,0.0,N)-(double)N,
bintreesum(vals,N,binsize)-(double)N );
} /* --- end-of-function main() --- */
#endif
So if you save that as trecu.c, then compile it as cc –DTESTTRECU trecu.c –lm –o trecu And then run with zero to three optional command-line args as trecu #trials e binsize Defaults are #trials=1000000 (like OP's program), e=–10, and binsize=2 (for my bintreesum() function to do a binary-tree sum rather than larger-size bins).
And here are some test results illustrating the problem described above,
bash-4.3$ ./trecu
N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N = 5.00000500e+01,
plain_sum-N = 5.00000500e+01,
trecu-N = 5.00000500e+01,
bintreesum-N = 5.00000500e+01
bash-4.3$ ./trecu 1000000 -15
N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N = 5.00000500e-04,
plain_sum-N = 5.01087168e-04,
trecu-N = 5.01087168e-04,
bintreesum-N = 5.00000548e-04
bash-4.3$
bash-4.3$ ./trecu 1000000 -16
N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N = 5.00000500e-05,
plain_sum-N = 6.67552231e-05,
trecu-N = 6.67552231e-05,
bintreesum-N = 5.00001479e-05
bash-4.3$
bash-4.3$ ./trecu 1000000 -17
N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N = 5.00000500e-06,
plain_sum-N = 0.00000000e+00,
trecu-N = 0.00000000e+00,
bintreesum-N = 4.99992166e-06
So you can see that for the default run, e=–10, everybody's doing everything right. That is, the top line that says "Sum" just does the n(n+1)/2 thing, so presumably displays the right answer. And everybody below that agrees for the default e=–10 test case. But for the e=–15 and e=–16 cases below that, trecu() exactly agrees with the plain_sum, while bintreesum stays pretty close to the right answer. And finally, for e=–17, plain_sum and trecu() have "disappeared", while bintreesum()'s still hanging in there pretty well.
So trecu()'s correctly doing the sum all right, but its recursion's apparently not doing that "binary tree" type of thing that my more straightforward iterative bintreesum()'s apparently doing correctly. And that indeed demonstrates that EOF's suggestion for "binary tree summation" realizes quite an improvement over the plain_sum for these 1+epsilon kind of cases. So we'd really like to see his trecu() recursion work!!! When I originally looked at it, I thought it did work. But that double-recursion (is there a special name for that?) in his default: case is apparently more confusing (at least to me:) than I thought. Like I said, it is doing the sum, but not the "binary tree" thing.
Okay, so who'd like to take on the challenge and explain what's going on in that trecu() recursion? And, maybe more importantly, fix it so it does what's intended. Thanks.

gsl Error in inifinite integration interval. bad integrand behavior found. How to fix it?

I'm getting the following error message after trying to do the a numerical integration on a infinte interval [0,inf) using GSL in C.
gsl: qags.c:553: ERROR: bad integrand behavior found in the integration interval
Default GSL error handler invoked.
Command terminated by signal 6
Here is the function I'm integrating
$
double dI2dmu(double x, void * parametros){
double *p,Ep,mu,M,T;
p=(double *) parametros;
M=p[0];
T=p[1];
mu=p[2];
Ep=sqrt(x*x+M*M);
double fplus= -((exp((Ep - mu)/T)/(pow(1 + exp((Ep - mu)/T),2)*T) - exp((Ep + \
mu)/T)/(pow(1 + exp((Ep + mu)/T),2)*T))*pow(x,2))/(2.*Ep*pow(PI,2));
return fplus;
}
And the code for the integration procedure
params[0]=0.007683; //M
params[1]=0.284000;// T
params[2]=0.1; //mu
gsl_function dI2mu_u;
dI2mu_u.function = &dI2dmu;
dI2mu_u.params = &params;
gsl_integration_qagiu (&dI2mu_u, 0, 0, 1e-7, 100000,
w, &resultTest2, &error1Test2);
The fucntion has the following aspect:
Which, to my eyes, has a very well behavior. So, instead of performing an infinite integration, I perform the integration up to an upper limit that I consider rezonable, like in:
gsl_function G;
G.function = &dI2dmu;
G.params = &params;
gsl_integration_qags (&G, 0, 1e2*A, 0, 1e-7, 100000,
w, &result1, &error1);
Getting a result that agrees with the result of Mathematica for infinite integration
result definite up to 10*A = 0.005065263943958745
result up to infinity = nan
Mathematica result up to infinity = 0.005065260000000000
But the GSL infinite integral keps being "nan". Any ideas? I thanks in advance for the help.
As #yonatan zuleta ochoa points out correctly, the problem is in exp(t)/pow(exp(t)+1,2). exp(t) can overflow an ieee754 DBL_MAX for values of t as low as nextafter(log(DBL_MAX), INFINITY), which is ~7.09783e2.
When exp(t) == INFINITY,
exp(t)/pow(exp(t)+1,2) == ∞/pow(∞+1,2) == ∞/∞ == NAN
Yonatan's proposed solution is to use logarithms, which can be done as follows:
exp(t)/pow(exp(t)+1,2) == exp(log(exp(t)) - log(pow(exp(t)+1,2)))
== exp(t - 2*log(exp(t)+1))
== exp(t - 2*log1p(exp(t))) //<math.h> function avoiding loss of precision for log(exp(t)+1)) if exp(t) << 1.0
This is an entirely reasonable approach, avoiding NAN up to very high values of t. However, in your code, t == (Ep ± mu)/T can be INFINITY if abs(T) < 1.0 for values of x close to DBL_MAX, even if x is not infinity. In this case, the subtraction t - 2*log1p(exp(t)) turns into ∞ - ∞, which is NAN again.
A different approach is to replace exp(x)/pow(exp(x)+1,2) with 1.0/(pow(exp(x)+1,2)*pow(exp(x), -1)) by dividing both denominator and numerator by exp(x) (which is not zero for any finite x). This simplifies to 1.0/(exp(x)+exp(-x)+2.0).
Here is an implementation of the function avoiding NAN for values of x up to and including DBL_MAX:
static double auxfun4(double a, double b, double c, double d)
{
return 1.0/(a*b+2.0+c*d);
}
double dI2dmu(double x, void * parametros)
{
double *p = (double *) parametros;
double invT = 1.0/p[1];
double Ep = hypot(x, p[0]);
double muexp = exp(p[2]*invT);
double Epexp = exp(Ep*invT);
double muinv = 1.0/muexp;
double Epinv = 1.0/Epexp;
double subterm = auxfun4(Epexp, muinv, Epinv, muexp);
subterm -= auxfun4(Epexp, muexp, Epinv, muinv);
double fminus = subterm*(x/Ep)*invT*(0.5/(M_PI*M_PI))*x;;
return -fminus;
}
This implementation also uses hypot(x,M), rather than sqrt(x*x, M*M), and avoids calculating x*x by rearranging the order of multiplications/divisions to group x/Ep together. Since hypot(x,M) will be abs(x) for abs(x) >> abs(M), the term x/Ep approaches 1.0 for large x.
I think the problem here is that unlike Mathematica, C does not use arbitrary precision in computing. Then, at some point when Exp [Ep] is calculated numerical computation overflows.
Now, GSL uses the transformation x = (1-t)/t, to map onto interval (0,1].
So, for t<<0 is posible to get nan results since the behavior of your function tends to indeterminations (0/0 or inf/inf,etc) for extreme values.
Maybe if you write out the terms
Exp[ ( Ep(x) - \Mu)/T ] / { 1 + Exp[( Ep(x) - \Mu )/T] }^2
using A/B = Exp[ Ln A - Ln B], you could get a better numerical behavior.
I will try if and I have nice results, then I'll tell you.
The solution
As I said before, you must take care the problems arising with indeterminate forms. So, lets write out the problematic terms using the logarithmic version:
double dIdmu(double x, void * parametros){
double *p,Ep,mu,M,T;
p=(double *) parametros;
M=p[0];
T=p[1];
mu=p[2];
Ep=sqrt(x*x+M*M);
double fplus= - ( exp( (Ep - mu)/T -2.0*log(1.0 + exp((Ep - mu)/T) ) ) - exp( (Ep + mu)/T -2.0*log(1.0 + exp((Ep + mu)/T) ) ) ) * pow(x,2) / (2.* T * Ep*pow(M_PI,2));
return fplus;
}
and with this main function
int main()
{
double params[3];
double resultTest2, error1Test2;
gsl_integration_workspace * w
= gsl_integration_workspace_alloc (10000);
params[0]=0.007683; //M
params[1]=0.284000;// T
params[2]=0.1; //mu
gsl_function dI2mu_u;
dI2mu_u.function = &dIdmu;
dI2mu_u.params = &params;
gsl_integration_qagiu (&dI2mu_u, 0.0, 1e-7, 1e-7, 10000, w, &resultTest2, &error1Test2);
printf("%e\n", resultTest2);
gsl_integration_workspace_free ( w);
return 0;
}
you get the answer:
-5.065288e-03.
I am curious... This is how I define the function in Mathematica
So comparing the answers:
GSL -5.065288e-03
Mathematica -0.005065287633739702

CORDIC Arcsine implementation fails

I have recently implemented a library of CORDIC functions to reduce the required computational power (my project is based on a PowerPC and is extremely strict in its execution time specifications). The language is ANSI-C.
The other functions (sin/cos/atan) work within accuracy limits both in 32 and in 64 bit implementations.
Unfortunately, the asin() function fails systematically for certain inputs.
For testing purposes I have implemented an .h file to be used in a simulink S-Function. (This is only for my convenience, you can compile the following as a standalone .exe with minimal changes)
Note: I have forced 32 iterations because I am working in 32 bit precision and the maximum possible accuracy is required.
Cordic.h:
#include <stdio.h>
#include <stdlib.h>
#define FLOAT32 float
#define INT32 signed long int
#define BIT_XOR ^
#define CORDIC_1K_32 0x26DD3B6A
#define MUL_32 1073741824.0F /*needed to scale float -> int*/
#define INV_MUL_32 9.313225746E-10F /*needed to scale int -> float*/
INT32 CORDIC_CTAB_32 [] = {0x3243f6a8, 0x1dac6705, 0x0fadbafc, 0x07f56ea6, 0x03feab76, 0x01ffd55b, 0x00fffaaa, 0x007fff55,
0x003fffea, 0x001ffffd, 0x000fffff, 0x0007ffff, 0x0003ffff, 0x0001ffff, 0x0000ffff, 0x00007fff,
0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f,
0x0000003f, 0x0000001f, 0x0000000f, 0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000};
/* CORDIC Arcsine Core: vectoring mode */
INT32 CORDIC_asin(INT32 arc_in)
{
INT32 k;
INT32 d;
INT32 tx;
INT32 ty;
INT32 x;
INT32 y;
INT32 z;
x=CORDIC_1K_32;
y=0;
z=0;
for (k=0; k<32; ++k)
{
d = (arc_in - y)>>(31);
tx = x - (((y>>k) BIT_XOR d) - d);
ty = y + (((x>>k) BIT_XOR d) - d);
z += ((CORDIC_CTAB_32[k] BIT_XOR d) - d);
x = tx;
y = ty;
}
return z;
}
/* Wrapper function for scaling in-out of cordic core*/
FLOAT32 asin_wrap(FLOAT32 arc)
{
return ((FLOAT32)(CORDIC_asin((INT32)(arc*MUL_32))*INV_MUL_32));
}
This can be called in a manner similar to:
#include "Cordic.h"
#include "math.h"
void main()
{
y1 = asin_wrap(value_32); /*my implementation*/
y2 = asinf(value_32); /*standard math.h for comparison*/
}
The results are as shown:
Top left shows the [-1;1] input over 2000 steps (0.001 increments), bottom left the output of my function, bottom right the standard output and top right the difference of the two outputs.
It is immediate to see that the error is not within 32 bit accuracy.
I have analysed the steps performed (and the intermediate results) by my code and it seems to me that at a certain point the value of y is "close enough" to the initial value of arc_in and what could be related to a bit-shift causes the solution to diverge.
My questions:
I am at a loss, is this error inherent in the CORDIC implementation or have I made a mistake in the implementation? I was expecting the decrease of accuracy near the extremes, but those spikes in the middle are quite unexpected. (the most notable ones are just beyond +/- 0.6, but even removed these there are more at smaller values, albeit not as pronounced)
If it is something part of the CORDIC implementation, are there known workarounds?
EDIT:
Since some comment mention it, yes, I tested the definition of INT32, even writing
#define INT32 int32_T
does not change the results by the slightest amount.
The computation time on the target hardware has been measured by hundreds of repetitions of block of 10.000 iterations of the function with random input in the validity range. The observed mean results (for one call of the function) are as follows:
math.h asinf() 100.00 microseconds
CORDIC asin() 5.15 microseconds
(apparently the previous test had been faulty, a new cross-test has obtained no better than an average of 100 microseconds across the validity range)
I apparently found a better implementation. It can be downloaded in matlab version here and in C here. I will analyse more its inner workings and report later.
To review a few things mentioned in the comments:
The given code outputs values identical to another CORDIC implementation. This includes the stated inaccuracies.
The largest error is as you approach arcsin(1).
The second largest error is that the values of arcsin(0.60726) to arcsin(0.68514) all return 0.754805.
There are some vague references to inaccuracies in the CORDIC method for some functions including arcsin. The given solution is to perform "double-iterations" although I have been unable to get this to work (all values give a large amount of error).
The alternate CORDIC implemention has a comment /* |a| < 0.98 */ in the arcsin() implementation which would seem to reinforce that there is known inaccuracies close to 1.
As a rough comparison of a few different methods consider the following results (all tests performed on a desktop, Windows7 computer using MSVC++ 2010, benchmarks timed using 10M iterations over the arcsin() range 0-1):
Question CORDIC Code: 1050 ms, 0.008 avg error, 0.173 max error
Alternate CORDIC Code (ref): 2600 ms, 0.008 avg error, 0.173 max error
atan() CORDIC Code: 2900 ms, 0.21 avg error, 0.28 max error
CORDIC Using Double-Iterations: 4700 ms, 0.26 avg error, 0.917 max error (???)
Math Built-in asin(): 200 ms, 0 avg error, 0 max error
Rational Approximation (ref): 250 ms, 0.21 avg error, 0.26 max error
Linear Table Lookup (see below) 100 ms, 0.000001 avg error, 0.00003 max error
Taylor Series (7th power, ref): 300 ms, 0.01 avg error, 0.16 max error
These results are on a desktop so how relevant they would be for an embedded system is a good question. If in doubt, profiling/benchmarking on the relevant system would be advised. Most solutions tested don't have very good accuracy over the range (0-1) and all but one are actually slower than the built-in asin() function.
The linear table lookup code is posted below and is my usual method for any expensive mathematical function when speed is desired over accuracy. It simply uses a 1024 element table with linear interpolation. It seems to be both the fastest and most accurate of all methods tested, although the built-in asin() is not much slower really (test it!). It can easily be adjusted for more or less accuracy by changing the size of the table.
// Please test this code before using in anything important!
const size_t ASIN_TABLE_SIZE = 1024;
double asin_table[ASIN_TABLE_SIZE];
int init_asin_table (void)
{
for (size_t i = 0; i < ASIN_TABLE_SIZE; ++i)
{
float f = (float) i / ASIN_TABLE_SIZE;
asin_table[i] = asin(f);
}
return 0;
}
double asin_table (double a)
{
static int s_Init = init_asin_table(); // Call automatically the first time or call it manually
double sign = 1.0;
if (a < 0)
{
a = -a;
sign = -1.0;
}
if (a > 1) return 0;
double fi = a * ASIN_TABLE_SIZE;
double decimal = fi - (int)fi;
size_t i = fi;
if (i >= ASIN_TABLE_SIZE-1) return Sign * 3.14159265359/2;
return Sign * ((1.0 - decimal)*asin_table[i] + decimal*asin_table[i+1]);
}
The "single rotate" arcsine goes badly wrong when the argument is just greater than the initial value of 'x', where that is the magical scaling factor -- 1/An ~= 0.607252935 ~= 0x26DD3B6A.
This is because, for all arguments > 0, the first step always has y = 0 < arg, so d = +1, which sets y = 1/An, and leaves x = 1/An. Looking at the second step:
if arg <= 1/An, then d = -1, and the steps which follow converge to a good answer
if arg > 1/An, then d = +1, and this step moves further away from the right answer, and for a range of values a little bigger than 1/An, the subsequent steps all have d = -1, but are unable to correct the result :-(
I found:
arg = 0.607 (ie 0x26D91687), relative error 7.139E-09 -- OK
arg = 0.608 (ie 0x26E978D5), relative error 1.550E-01 -- APALLING !!
arg = 0.685 (ie 0x2BD70A3D), relative error 2.667E-04 -- BAD !!
arg = 0.686 (ie 0x2BE76C8B), relative error 1.232E-09 -- OK, again
The descriptions of the method warn about abs(arg) >= 0.98 (or so), and I found that somewhere after 0.986 the process fails to converge and the relative error jumps to ~5E-02 and hits 1E-01 (!!) at arg=1 :-(
As you did, I also found that for 0.303 < arg < 0.313 the relative error jumps to ~3E-02, and reduces slowly until things return to normal. (In this case step 2 overshoots so far that the remaining steps cannot correct it.)
So... the single rotate CORDIC for arcsine looks rubbish to me :-(
Added later... when I looked even closer at the single rotate CORDIC, I found many more small regions where the relative error is BAD...
...so I would not touch this as a method at all... it's not just rubbish, it's useless.
BTW: I thoroughly recommend "Software Manual for the Elementary Functions", William Cody and William Waite, Prentice-Hall, 1980. The methods for calculating the functions are not so interesting any more (but there is a thorough, practical discussion of the relevant range-reductions required). However, for each function they give a good test procedure.
The additional source I linked at the end of the question apparently contains the solution.
The proposed code can be reduced to the following:
#define M_PI_2_32 1.57079632F
#define SQRT2_2 7.071067811865476e-001F /* sin(45°) = cos(45°) = sqrt(2)/2 */
FLOAT32 angles[] = {
7.8539816339744830962E-01F, 4.6364760900080611621E-01F, 2.4497866312686415417E-01F, 1.2435499454676143503E-01F,
6.2418809995957348474E-02F, 3.1239833430268276254E-02F, 1.5623728620476830803E-02F, 7.8123410601011112965E-03F,
3.9062301319669718276E-03F, 1.9531225164788186851E-03F, 9.7656218955931943040E-04F, 4.8828121119489827547E-04F,
2.4414062014936176402E-04F, 1.2207031189367020424E-04F, 6.1035156174208775022E-05F, 3.0517578115526096862E-05F,
1.5258789061315762107E-05F, 7.6293945311019702634E-06F, 3.8146972656064962829E-06F, 1.9073486328101870354E-06F,
9.5367431640596087942E-07F, 4.7683715820308885993E-07F, 2.3841857910155798249E-07F, 1.1920928955078068531E-07F,
5.9604644775390554414E-08F, 2.9802322387695303677E-08F, 1.4901161193847655147E-08F, 7.4505805969238279871E-09F,
3.7252902984619140453E-09F, 1.8626451492309570291E-09F, 9.3132257461547851536E-10F, 4.6566128730773925778E-10F};
FLOAT32 arcsin_cordic(FLOAT32 t)
{
INT32 i;
INT32 j;
INT32 flip;
FLOAT32 poweroftwo;
FLOAT32 sigma;
FLOAT32 sign_or;
FLOAT32 theta;
FLOAT32 x1;
FLOAT32 x2;
FLOAT32 y1;
FLOAT32 y2;
flip = 0;
theta = 0.0F;
x1 = 1.0F;
y1 = 0.0F;
poweroftwo = 1.0F;
/* If the angle is small, use the small angle approximation */
if ((t >= -0.002F) && (t <= 0.002F))
{
return t;
}
if (t >= 0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
/* The inv_sqrt() is the famous Fast Inverse Square Root from the Quake 3 engine
here used with 3 (!!) Newton iterations */
if ((t >= SQRT2_2) || (t <= -SQRT2_2))
{
t = 1.0F/inv_sqrt(1-t*t);
flip = 1;
}
if (t>=0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
for ( j = 0; j < 32; j++ )
{
if (y1 > t)
{
sigma = -1.0F;
}
else
{
sigma = 1.0F;
}
/* Here a double iteration is done */
x2 = x1 - (sigma * poweroftwo * y1);
y2 = (sigma * poweroftwo * x1) + y1;
x1 = x2 - (sigma * poweroftwo * y2);
y1 = (sigma * poweroftwo * x2) + y2;
theta += 2.0F * sigma * angles[j];
t *= (1.0F + poweroftwo * poweroftwo);
poweroftwo *= 0.5F;
}
/* Remove bias */
theta -= sign_or*4.85E-8F;
if (flip)
{
theta = sign_or*(M_PI_2_32-theta);
}
return theta;
}
The following is to be noted:
It is a "Double-Iteration" CORDIC implementation.
The angles table thus differs in construction from the old table.
And the computation is done in floating point notation, this will cause a major increase in computation time on the target hardware.
A small bias is present in the output, removed via the theta -= sign_or*4.85E-8F; passage.
The following picture shows the absolute (left) and relative errors (right) of the old implementation (top) vs the implementation contained in this answer (bottom).
The relative error is obtained only by dividing the CORDIC output with the output of the built-in math.h implementation. It is plotted around 1 and not 0 for this reason.
The peak relative error (when not dividing by zero) is 1.0728836e-006.
The average relative error is 2.0253509e-007 (almost in accordance to 32 bit accuracy).
For convergence of iterative process it is necessary that any "wrong" i-th
iteration could be "corrected" in the subsequent (i+1)-th, (i+2)-th, (i+3)-th,
etc. etc. iterations. Or, in other words, at least a half of the "wrong"
i-th iteration could be corrected in the next (i+1)-th iteration.
For atan(1/2^i) this condition is satisfied, i.e.:
atan(1/2^(i+1)) > 1/2*atan(1/2^i)
Read more at
http://cordic-bibliography.blogspot.com/p/double-iterations-in-cordic.html
and:
http://baykov.de/CORDIC1972.htm
(note I'm the author of those pages)

Calculating the Power spectral density

I am trying to get the PSD of a real data set by making use of fftw3 library
To test I wrote a small program as shown below ,that generates the a signal which follows sinusoidal function
#include <stdio.h>
#include <math.h>
#define PI 3.14
int main (){
double value= 0.0;
float frequency = 5;
int i = 0 ;
double time = 0.0;
FILE* outputFile = NULL;
outputFile = fopen("sinvalues","wb+");
if(outputFile==NULL){
printf(" couldn't open the file \n");
return -1;
}
for (i = 0; i<=5000;i++){
value = sin(2*PI*frequency*zeit);
fwrite(&value,sizeof(double),1,outputFile);
zeit += (1.0/frequency);
}
fclose(outputFile);
return 0;
}
Now I'm reading the output file of above program and trying to calculate its PSD like as shown below
#include <stdio.h>
#include <fftw3.h>
#include <complex.h>
#include <stdlib.h>
#include <math.h>
#define PI 3.14
int main (){
FILE* inp = NULL;
FILE* oup = NULL;
double* value;// = 0.0;
double* result;
double spectr = 0.0 ;
int windowsSize =512;
double power_spectrum = 0.0;
fftw_plan plan;
int index=0,i ,k;
double multiplier =0.0;
inp = fopen("1","rb");
oup = fopen("psd","wb+");
value=(double*)malloc(sizeof(double)*windowsSize);
result = (double*)malloc(sizeof(double)*(windowsSize)); // what is the length that I have to choose here ?
plan =fftw_plan_r2r_1d(windowsSize,value,result,FFTW_R2HC,FFTW_ESTIMATE);
while(!feof(inp)){
index =fread(value,sizeof(double),windowsSize,inp);
// zero padding
if( index != windowsSize){
for(i=index;i<windowsSize;i++){
value[i] = 0.0;
}
}
// windowing Hann
for (i=0; i<windowsSize; i++){
multiplier = 0.5*(1-cos(2*PI*i/(windowsSize-1)));
value[i] *= multiplier;
}
fftw_execute(plan);
for(i = 0;i<(windowsSize/2 +1) ;i++){ //why only tell the half size of the window
power_spectrum = result[i]*result[i] +result[windowsSize/2 +1 -i]*result[windowsSize/2 +1 -i];
printf("%lf \t\t\t %d \n",power_spectrum,i);
fprintf(oup," %lf \n ",power_spectrum);
}
}
fclose(oup);
fclose(inp);
return 0;
}
Iam not sure about the correctness of the way I am doing this, but below are the results i have obtained:
Can any one help me in tracing the errors of the above approach
Thanks in advance
*UPDATE
after hartmut answer I'vve edited the code but still got the same result :
and the input data look like :
UPDATE
after increasing the sample frequencyand a windows size of 2048 here is what I've got :
UPDATE
after using the ADD-ON here how the result looks like using the window :
You combine the wrong output values to power spectrum lines. There are windowsSize / 2 + 1 real values at the beginning of result and windowsSize / 2 - 1 imaginary values at the end in reverse order. This is because the imaginary components of the first (0Hz) and last (Nyquist frequency) spectral lines are 0.
int spectrum_lines = windowsSize / 2 + 1;
power_spectrum = (double *)malloc( sizeof(double) * spectrum_lines );
power_spectrum[0] = result[0] * result[0];
for ( i = 1 ; i < windowsSize / 2 ; i++ )
power_spectrum[i] = result[i]*result[i] + result[windowsSize-i]*result[windowsSize-i];
power_spectrum[i] = result[i] * result[i];
And there is a minor mistake: You should apply the window function only to the input signal and not to the zero-padding part.
ADD-ON:
Your test program generates 5001 samples of a sinusoid signal and then you read and analyse the first 512 samples of this signal. The result of this is that you analyse only a fraction of a period. Due to the hard cut-off of the signal it contains a wide spectrum of energy with almost unpredictable energy levels, because you not even use PI but only 3.41 which is not precise enough to do any predictable calculation.
You need to guarantee that an integer number of periods is exactly fitting into your analysis window of 512 samples. Therefore, you should change this in your test signal creation program to have exactly numberOfPeriods periods in your test signal (e.g. numberOfPeriods=1 means that one period of the sinoid has a period of exactly 512 samples, 2 => 256, 3 => 512/3, 4 => 128, ...). This way, you are able to generate energy at a specific spectral line. Keep in mind that windowSize must have the same value in both programs because different sizes make this effort useless.
#define PI 3.141592653589793 // This has to be absolutely exact!
int windowSize = 512; // Total number of created samples in the test signal
int numberOfPeriods = 64; // Total number of sinoid periods in the test signal
for ( n = 0 ; n < windowSize ; ++n ) {
value = sin( (2 * PI * numberOfPeriods * n) / windowSize );
fwrite( &value, sizeof(double), 1, outputFile );
}
Some remarks to your expected output function.
Your input is a function with pure real values.
The result of a DFT has complex values.
So you have to declare the variable out not as double but as fftw_complex *out.
In general the number of dft input values is the same as the number of output values.
However, the output spectrum of a dft contains the complex amplitudes for positive
frequencies as well as for negative frequencies.
In the special case for pure real input, the amplitudes of the positive frequencies are
conjugated complex values of the amplitudes of the negative frequencies.
For that, only the frequencies of the positive spectrum are calculated,
which means that the number of the complex output values is the half of
the number of real input values.
If your input is a simple sinewave, the spectrum contains only a single frequency component.
This is true for 10, 100, 1000 or even more input samples.
All other values are zero. So it doesn't make any sense to work with a huge number of input values.
If the input data set contains a single period, the complex output value is
contained in out[1].
If the If the input data set contains M complete periods, in your case 5,
so the result is stored in out[5]
I did some modifications on your code. To make some facts more clear.
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include "fftw3.h"
int performDFT(int nbrOfInputSamples, char *fileName)
{
int nbrOfOutputSamples;
double *in;
fftw_complex *out;
fftw_plan p;
// In the case of pure real input data,
// the output values of the positive frequencies and the negative frequencies
// are conjugated complex values.
// This means, that there no need for calculating both.
// If you have the complex values for the positive frequencies,
// you can calculate the values of the negative frequencies just by
// changing the sign of the value's imaginary part
// So the number of complex output values ( amplitudes of frequency components)
// are the half of the number of the real input values ( amplitutes in time domain):
nbrOfOutputSamples = ceil(nbrOfInputSamples/2.0);
// Create a plan for a 1D DFT with real input and complex output
in = (double*) fftw_malloc(sizeof(double) * nbrOfInputSamples);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * nbrOfOutputSamples);
p = fftw_plan_dft_r2c_1d(nbrOfInputSamples, in, out, FFTW_ESTIMATE);
// Read data from input file to input array
FILE* inputFile = NULL;
inputFile = fopen(fileName,"r");
if(inputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
}
double value;
int idx = 0;
while(!feof(inputFile)){
fscanf(inputFile, "%lf", &value);
in[idx++] = value;
}
fclose(inputFile);
// Perform the dft
fftw_execute(p);
// Print output results
char outputFileName[] = "dftvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(outputFileName,"w+");
if(outputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", outputFileName);
return -1;
}
double realVal;
double imagVal;
double powVal;
double absVal;
fprintf(stdout, " Frequency Real Imag Abs Power\n");
for (idx=0; idx<nbrOfOutputSamples; idx++) {
realVal = out[idx][0]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
imagVal = out[idx][1]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
powVal = 2*(realVal*realVal + imagVal*imagVal);
absVal = sqrt(powVal/2);
if (idx == 0) {
powVal /=2;
}
fprintf(outputFile, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
fprintf(stdout, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
// The total signal power of a frequency is the sum of the power of the posive and the negative frequency line.
// Because only the positive spectrum is calculated, the power is multiplied by two.
// However, there is only one single line in the prectrum for DC.
// This means, the DC value must not be doubled.
}
fclose(outputFile);
// Clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
return 0;
}
int main(int argc, const char * argv[]) {
// Set basic parameters
float timeIntervall = 1.0; // in seconds
int nbrOfSamples = 50; // number of Samples per time intervall, so the unit is S/s
double timeStep = timeIntervall/nbrOfSamples; // in seconds
float frequency = 5; // frequency in Hz
// The period time of the signal is 1/5Hz = 0.2s
// The number of samples per period is: nbrOfSamples/frequency = (50S/s)/5Hz = 10S
// The number of periods per time intervall is: frequency*timeIntervall = 5Hz*1.0s = (5/s)*1.0s = 5
// Open file for writing signal values
char fileName[] = "sinvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(fileName,"w+");
if(outputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
}
// Calculate signal values and write them to file
double time;
double value;
double dcValue = 0.2;
int idx = 0;
fprintf(stdout, " SampleNbr Signal value\n");
for (time = 0; time<=timeIntervall; time += timeStep){
value = sin(2*M_PI*frequency*time) + dcValue;
fprintf(outputFile, "%lf\n",value);
fprintf(stdout, "%10i %15.5f\n",idx++, value);
}
fclose(outputFile);
performDFT(nbrOfSamples, fileName);
return 0;
}
If the input of a dft is pure real, the output is complex in any case.
So you have to use the plan r2c (RealToComplex).
If the signal is sin(2*pi*f*t), starting at t=0, the spectrum contains a single frequency line
at f, which is pure imaginary.
If the sign has an offset in phase, like sin(2*pi*f*t+phi) the single line's value is complex.
If your sampling frequency is fs, the range of the output spectrum is -fs/2 ... +fs/2.
The real parts of the positive and negative frequencies are the same.
The imaginary parts of the positive and negative frequencies have opposite signs.
This is called conjugated complex.
If you have the complex values of the positive spectrum you can calculate the values of the
negative spectrum by changing the sign of the imaginary parts.
For this reason there is no need to compute both, the positive and the negative sprectrum.
One sideband holds all information.
Therefore the number of output samples in the plan r2c is the half+1 of the number
of input samples.
To get the power of a frequency, you have to consider the positive frequency as well
as the negative frequency. However, the plan r2c delivers only the right positive half
of the spectrum. So you have to double the power of the positive side to get the total power.
By the way, the documentation of the fftw3 package describes the usage of plans quite well.
You should invest the time to go over the manual.
I'm not sure what your question is. Your results seem reasonable, with the information provided.
As you must know, the PSD is the Fourier transform of the autocorrelation function. With sine wave inputs, your AC function will be periodic, therefore the PSD will have tones, like you've plotted.
My 'answer' is really some thought starters on debugging. It would be easier for all involved if we could post equations. You probably know that there's a signal processing section on SE these days.
First, you should give us a plot of your AC function. The inverse FT of the PSD you've shown will be a linear combination of periodic tones.
Second, try removing the window, just make it a box or skip the step if you can.
Third, try replacing the DFT with the FFT (I only skimmed the fftw3 library docs, maybe this is an option).
Lastly, trying inputting white noise. You can use a Bernoulli dist, or just a Gaussian dist. The AC will be a delta function, although the sample AC will not. This should give you a (sample) white PSD distribution.
I hope these suggestions help.

Resources