CORDIC Arcsine implementation fails - c
I have recently implemented a library of CORDIC functions to reduce the required computational power (my project is based on a PowerPC and is extremely strict in its execution time specifications). The language is ANSI-C.
The other functions (sin/cos/atan) work within accuracy limits both in 32 and in 64 bit implementations.
Unfortunately, the asin() function fails systematically for certain inputs.
For testing purposes I have implemented an .h file to be used in a simulink S-Function. (This is only for my convenience, you can compile the following as a standalone .exe with minimal changes)
Note: I have forced 32 iterations because I am working in 32 bit precision and the maximum possible accuracy is required.
Cordic.h:
#include <stdio.h>
#include <stdlib.h>
#define FLOAT32 float
#define INT32 signed long int
#define BIT_XOR ^
#define CORDIC_1K_32 0x26DD3B6A
#define MUL_32 1073741824.0F /*needed to scale float -> int*/
#define INV_MUL_32 9.313225746E-10F /*needed to scale int -> float*/
INT32 CORDIC_CTAB_32 [] = {0x3243f6a8, 0x1dac6705, 0x0fadbafc, 0x07f56ea6, 0x03feab76, 0x01ffd55b, 0x00fffaaa, 0x007fff55,
0x003fffea, 0x001ffffd, 0x000fffff, 0x0007ffff, 0x0003ffff, 0x0001ffff, 0x0000ffff, 0x00007fff,
0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f,
0x0000003f, 0x0000001f, 0x0000000f, 0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000};
/* CORDIC Arcsine Core: vectoring mode */
INT32 CORDIC_asin(INT32 arc_in)
{
INT32 k;
INT32 d;
INT32 tx;
INT32 ty;
INT32 x;
INT32 y;
INT32 z;
x=CORDIC_1K_32;
y=0;
z=0;
for (k=0; k<32; ++k)
{
d = (arc_in - y)>>(31);
tx = x - (((y>>k) BIT_XOR d) - d);
ty = y + (((x>>k) BIT_XOR d) - d);
z += ((CORDIC_CTAB_32[k] BIT_XOR d) - d);
x = tx;
y = ty;
}
return z;
}
/* Wrapper function for scaling in-out of cordic core*/
FLOAT32 asin_wrap(FLOAT32 arc)
{
return ((FLOAT32)(CORDIC_asin((INT32)(arc*MUL_32))*INV_MUL_32));
}
This can be called in a manner similar to:
#include "Cordic.h"
#include "math.h"
void main()
{
y1 = asin_wrap(value_32); /*my implementation*/
y2 = asinf(value_32); /*standard math.h for comparison*/
}
The results are as shown:
Top left shows the [-1;1] input over 2000 steps (0.001 increments), bottom left the output of my function, bottom right the standard output and top right the difference of the two outputs.
It is immediate to see that the error is not within 32 bit accuracy.
I have analysed the steps performed (and the intermediate results) by my code and it seems to me that at a certain point the value of y is "close enough" to the initial value of arc_in and what could be related to a bit-shift causes the solution to diverge.
My questions:
I am at a loss, is this error inherent in the CORDIC implementation or have I made a mistake in the implementation? I was expecting the decrease of accuracy near the extremes, but those spikes in the middle are quite unexpected. (the most notable ones are just beyond +/- 0.6, but even removed these there are more at smaller values, albeit not as pronounced)
If it is something part of the CORDIC implementation, are there known workarounds?
EDIT:
Since some comment mention it, yes, I tested the definition of INT32, even writing
#define INT32 int32_T
does not change the results by the slightest amount.
The computation time on the target hardware has been measured by hundreds of repetitions of block of 10.000 iterations of the function with random input in the validity range. The observed mean results (for one call of the function) are as follows:
math.h asinf() 100.00 microseconds
CORDIC asin() 5.15 microseconds
(apparently the previous test had been faulty, a new cross-test has obtained no better than an average of 100 microseconds across the validity range)
I apparently found a better implementation. It can be downloaded in matlab version here and in C here. I will analyse more its inner workings and report later.
To review a few things mentioned in the comments:
The given code outputs values identical to another CORDIC implementation. This includes the stated inaccuracies.
The largest error is as you approach arcsin(1).
The second largest error is that the values of arcsin(0.60726) to arcsin(0.68514) all return 0.754805.
There are some vague references to inaccuracies in the CORDIC method for some functions including arcsin. The given solution is to perform "double-iterations" although I have been unable to get this to work (all values give a large amount of error).
The alternate CORDIC implemention has a comment /* |a| < 0.98 */ in the arcsin() implementation which would seem to reinforce that there is known inaccuracies close to 1.
As a rough comparison of a few different methods consider the following results (all tests performed on a desktop, Windows7 computer using MSVC++ 2010, benchmarks timed using 10M iterations over the arcsin() range 0-1):
Question CORDIC Code: 1050 ms, 0.008 avg error, 0.173 max error
Alternate CORDIC Code (ref): 2600 ms, 0.008 avg error, 0.173 max error
atan() CORDIC Code: 2900 ms, 0.21 avg error, 0.28 max error
CORDIC Using Double-Iterations: 4700 ms, 0.26 avg error, 0.917 max error (???)
Math Built-in asin(): 200 ms, 0 avg error, 0 max error
Rational Approximation (ref): 250 ms, 0.21 avg error, 0.26 max error
Linear Table Lookup (see below) 100 ms, 0.000001 avg error, 0.00003 max error
Taylor Series (7th power, ref): 300 ms, 0.01 avg error, 0.16 max error
These results are on a desktop so how relevant they would be for an embedded system is a good question. If in doubt, profiling/benchmarking on the relevant system would be advised. Most solutions tested don't have very good accuracy over the range (0-1) and all but one are actually slower than the built-in asin() function.
The linear table lookup code is posted below and is my usual method for any expensive mathematical function when speed is desired over accuracy. It simply uses a 1024 element table with linear interpolation. It seems to be both the fastest and most accurate of all methods tested, although the built-in asin() is not much slower really (test it!). It can easily be adjusted for more or less accuracy by changing the size of the table.
// Please test this code before using in anything important!
const size_t ASIN_TABLE_SIZE = 1024;
double asin_table[ASIN_TABLE_SIZE];
int init_asin_table (void)
{
for (size_t i = 0; i < ASIN_TABLE_SIZE; ++i)
{
float f = (float) i / ASIN_TABLE_SIZE;
asin_table[i] = asin(f);
}
return 0;
}
double asin_table (double a)
{
static int s_Init = init_asin_table(); // Call automatically the first time or call it manually
double sign = 1.0;
if (a < 0)
{
a = -a;
sign = -1.0;
}
if (a > 1) return 0;
double fi = a * ASIN_TABLE_SIZE;
double decimal = fi - (int)fi;
size_t i = fi;
if (i >= ASIN_TABLE_SIZE-1) return Sign * 3.14159265359/2;
return Sign * ((1.0 - decimal)*asin_table[i] + decimal*asin_table[i+1]);
}
The "single rotate" arcsine goes badly wrong when the argument is just greater than the initial value of 'x', where that is the magical scaling factor -- 1/An ~= 0.607252935 ~= 0x26DD3B6A.
This is because, for all arguments > 0, the first step always has y = 0 < arg, so d = +1, which sets y = 1/An, and leaves x = 1/An. Looking at the second step:
if arg <= 1/An, then d = -1, and the steps which follow converge to a good answer
if arg > 1/An, then d = +1, and this step moves further away from the right answer, and for a range of values a little bigger than 1/An, the subsequent steps all have d = -1, but are unable to correct the result :-(
I found:
arg = 0.607 (ie 0x26D91687), relative error 7.139E-09 -- OK
arg = 0.608 (ie 0x26E978D5), relative error 1.550E-01 -- APALLING !!
arg = 0.685 (ie 0x2BD70A3D), relative error 2.667E-04 -- BAD !!
arg = 0.686 (ie 0x2BE76C8B), relative error 1.232E-09 -- OK, again
The descriptions of the method warn about abs(arg) >= 0.98 (or so), and I found that somewhere after 0.986 the process fails to converge and the relative error jumps to ~5E-02 and hits 1E-01 (!!) at arg=1 :-(
As you did, I also found that for 0.303 < arg < 0.313 the relative error jumps to ~3E-02, and reduces slowly until things return to normal. (In this case step 2 overshoots so far that the remaining steps cannot correct it.)
So... the single rotate CORDIC for arcsine looks rubbish to me :-(
Added later... when I looked even closer at the single rotate CORDIC, I found many more small regions where the relative error is BAD...
...so I would not touch this as a method at all... it's not just rubbish, it's useless.
BTW: I thoroughly recommend "Software Manual for the Elementary Functions", William Cody and William Waite, Prentice-Hall, 1980. The methods for calculating the functions are not so interesting any more (but there is a thorough, practical discussion of the relevant range-reductions required). However, for each function they give a good test procedure.
The additional source I linked at the end of the question apparently contains the solution.
The proposed code can be reduced to the following:
#define M_PI_2_32 1.57079632F
#define SQRT2_2 7.071067811865476e-001F /* sin(45°) = cos(45°) = sqrt(2)/2 */
FLOAT32 angles[] = {
7.8539816339744830962E-01F, 4.6364760900080611621E-01F, 2.4497866312686415417E-01F, 1.2435499454676143503E-01F,
6.2418809995957348474E-02F, 3.1239833430268276254E-02F, 1.5623728620476830803E-02F, 7.8123410601011112965E-03F,
3.9062301319669718276E-03F, 1.9531225164788186851E-03F, 9.7656218955931943040E-04F, 4.8828121119489827547E-04F,
2.4414062014936176402E-04F, 1.2207031189367020424E-04F, 6.1035156174208775022E-05F, 3.0517578115526096862E-05F,
1.5258789061315762107E-05F, 7.6293945311019702634E-06F, 3.8146972656064962829E-06F, 1.9073486328101870354E-06F,
9.5367431640596087942E-07F, 4.7683715820308885993E-07F, 2.3841857910155798249E-07F, 1.1920928955078068531E-07F,
5.9604644775390554414E-08F, 2.9802322387695303677E-08F, 1.4901161193847655147E-08F, 7.4505805969238279871E-09F,
3.7252902984619140453E-09F, 1.8626451492309570291E-09F, 9.3132257461547851536E-10F, 4.6566128730773925778E-10F};
FLOAT32 arcsin_cordic(FLOAT32 t)
{
INT32 i;
INT32 j;
INT32 flip;
FLOAT32 poweroftwo;
FLOAT32 sigma;
FLOAT32 sign_or;
FLOAT32 theta;
FLOAT32 x1;
FLOAT32 x2;
FLOAT32 y1;
FLOAT32 y2;
flip = 0;
theta = 0.0F;
x1 = 1.0F;
y1 = 0.0F;
poweroftwo = 1.0F;
/* If the angle is small, use the small angle approximation */
if ((t >= -0.002F) && (t <= 0.002F))
{
return t;
}
if (t >= 0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
/* The inv_sqrt() is the famous Fast Inverse Square Root from the Quake 3 engine
here used with 3 (!!) Newton iterations */
if ((t >= SQRT2_2) || (t <= -SQRT2_2))
{
t = 1.0F/inv_sqrt(1-t*t);
flip = 1;
}
if (t>=0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
for ( j = 0; j < 32; j++ )
{
if (y1 > t)
{
sigma = -1.0F;
}
else
{
sigma = 1.0F;
}
/* Here a double iteration is done */
x2 = x1 - (sigma * poweroftwo * y1);
y2 = (sigma * poweroftwo * x1) + y1;
x1 = x2 - (sigma * poweroftwo * y2);
y1 = (sigma * poweroftwo * x2) + y2;
theta += 2.0F * sigma * angles[j];
t *= (1.0F + poweroftwo * poweroftwo);
poweroftwo *= 0.5F;
}
/* Remove bias */
theta -= sign_or*4.85E-8F;
if (flip)
{
theta = sign_or*(M_PI_2_32-theta);
}
return theta;
}
The following is to be noted:
It is a "Double-Iteration" CORDIC implementation.
The angles table thus differs in construction from the old table.
And the computation is done in floating point notation, this will cause a major increase in computation time on the target hardware.
A small bias is present in the output, removed via the theta -= sign_or*4.85E-8F; passage.
The following picture shows the absolute (left) and relative errors (right) of the old implementation (top) vs the implementation contained in this answer (bottom).
The relative error is obtained only by dividing the CORDIC output with the output of the built-in math.h implementation. It is plotted around 1 and not 0 for this reason.
The peak relative error (when not dividing by zero) is 1.0728836e-006.
The average relative error is 2.0253509e-007 (almost in accordance to 32 bit accuracy).
For convergence of iterative process it is necessary that any "wrong" i-th
iteration could be "corrected" in the subsequent (i+1)-th, (i+2)-th, (i+3)-th,
etc. etc. iterations. Or, in other words, at least a half of the "wrong"
i-th iteration could be corrected in the next (i+1)-th iteration.
For atan(1/2^i) this condition is satisfied, i.e.:
atan(1/2^(i+1)) > 1/2*atan(1/2^i)
Read more at
http://cordic-bibliography.blogspot.com/p/double-iterations-in-cordic.html
and:
http://baykov.de/CORDIC1972.htm
(note I'm the author of those pages)
Related
DSP libraries - RFFT - strange results
Recently I've been trying to do FFT calculations on my STM32F4-Discovery evaluation board then send it to PC. I have looked into my problem - I think that I'm doing something wrong with FFT functions provided by manufacturer. I'm using CMSIS-DSP libraries. For now I've have been generating samples with code (if that works correct I'll do sampling by microphone). I'm using arm_rfft_fast_f32 as my data are going to be floats in the future, but results I get in my output array are insane (I think) - I'm getting frequencies below 0. number_of_samples = 512; (l_probek in code) dt = 1/freq/number_of_samples Here is my code float32_t buffer_input[l_probek]; uint16_t i; uint8_t mode; float32_t dt; float32_t freq; bool DoFlag = false; bool UBFlag = false; uint32_t rozmiar = 4*l_probek; union { float32_t f[l_probek]; uint8_t b[4*l_probek]; }data_out; union { float32_t f[l_probek]; uint8_t b[4*l_probek]; }data_mag; union { float32_t f; uint8_t b[4]; }czest_rozdz; /* Pointers ------------------------------------------------------------------*/ arm_rfft_fast_instance_f32 S; arm_cfft_radix4_instance_f32 S_CFFT; uint16_t output; /* ---------------------------------------------------------------------------*/ int main(void) { freq = 5000; dt = 0.000000390625; _GPIO(); _LED(); _NVIC(); _EXTI(0); arm_rfft_fast_init_f32(&S, l_probek); GPIO_SetBits(GPIOD, LED_Green); mode = 2; //----------------- Infinite loop while (1) { if(true)//(UBFlag == true) for(i=0; i<l_probek; ++i) { buffer_input[i] = (float32_t) 15*sin(2*PI*freq*i*dt); } //Obliczanie FFT arm_rfft_fast_f32(&S, buffer_input, data_out.f, 0); //Obliczanie modulow arm_cmplx_mag_f32(data_out.f, data_mag.f, l_probek); USART_putdata(USART1, data_out.b, data_mag.b, rozmiar); //USART_putdata(USART1, czest_rozdz.b, data_mag.b, rozmiar); GPIO_ToggleBits(GPIOD, LED_Orange); //mode++; //UBFlag = false; } } }
I'm using arm_rfft_fast_f32 as my data are going to be floats in the future, but results I get in my output array are insane (I think) - I'm getting frequencies below 0. The arm_rfft_fast_f32 function does not return frequencies, but rather complex-valued coefficients computed using the Fast Fourier Transform (FFT). It is thus perfectly reasonable for those coefficients to be negative. More specifically, the expected coefficients for your single-cycle sin test tone input with an amplitude of 15 would be: 0.0, 0.0; // special case packing real-valued X[0] and X[N/2] 0.0, -3840.0; // X[1] 0.0, 0.0; // X[2] 0.0, 0.0; // X[3] ... 0.0, 0.0; // X[255] Note that as indicated in the documentation the first two outputs correspond to the purely real coefficients X[0] and X[N/2] (you should be particularly careful about this special case in your subsequent call to arm_cmplx_mag_f32; see last point below). The frequency of each of those frequency components are given by k*fs/N, where N is the number of samples (in your case l_probek) and fs = 1/dt is the sampling rate (in your case freq*l_probek): X[0] -> 0*freq*l_probek/l_probek = 0 X[1] -> 1*freq*l_probek/l_probek = freq = 5000 X[2] -> 2*freq*l_probek/l_probek = 2*freq = 10000 X[3] -> 3*freq*l_probek/l_probek = 2*freq = 15000 ... Finally, due to the special packing of the first two values, you need to be careful when computing the N/2+1 magnitudes: // General case for the magnitudes arm_cmplx_mag_f32(data_out.f+2, data_mag.f+1, l_probek/2 - 1); // Handle special cases data_mag.f[0] = data_out.f[0]; data_mag.f[l_probek/2] = data_out.f[1];
As a follow-up to the above answer, which is awesome, some further clarifications which took me an age to figure out. The frequency bins are centered on the target frequency, so for instance in the example above X[0] represents -2500Hz to 2500Hz, centered on zero, X[1] is 2500Hz to 7500Hz centered on 5000Hz and so on It's common to interpolate frequencies within the bin by looking at the energy of the adjacent bins (see https://dspguru.com/dsp/howtos/how-to-interpolate-fft-peak/) if you do this you will need to make sure that your magnitude array is large enough for the bins + Nyquist and that the bin above Nyquist is 0, but note many interpolation techniques require the complex values (e.q. Quinn, Jacobson) so make sure you interpolate before finding the magnitudes. The special case code above works because there is no complex component of the DC and Nyquist values and thus the magnitude is simply the real part There is a bug in the code above however - although the imaginary parts of the DC and Nyquist components is always zero, the real part could still be negative, so you need to take the absolute value to get the magnitude: // Handle special cases data_mag.f[0] = fabs(data_out.f[0]); data_mag.f[l_probek/2] = fabs(data_out.f[1]);
gsl Error in inifinite integration interval. bad integrand behavior found. How to fix it?
I'm getting the following error message after trying to do the a numerical integration on a infinte interval [0,inf) using GSL in C. gsl: qags.c:553: ERROR: bad integrand behavior found in the integration interval Default GSL error handler invoked. Command terminated by signal 6 Here is the function I'm integrating $ double dI2dmu(double x, void * parametros){ double *p,Ep,mu,M,T; p=(double *) parametros; M=p[0]; T=p[1]; mu=p[2]; Ep=sqrt(x*x+M*M); double fplus= -((exp((Ep - mu)/T)/(pow(1 + exp((Ep - mu)/T),2)*T) - exp((Ep + \ mu)/T)/(pow(1 + exp((Ep + mu)/T),2)*T))*pow(x,2))/(2.*Ep*pow(PI,2)); return fplus; } And the code for the integration procedure params[0]=0.007683; //M params[1]=0.284000;// T params[2]=0.1; //mu gsl_function dI2mu_u; dI2mu_u.function = &dI2dmu; dI2mu_u.params = ¶ms; gsl_integration_qagiu (&dI2mu_u, 0, 0, 1e-7, 100000, w, &resultTest2, &error1Test2); The fucntion has the following aspect: Which, to my eyes, has a very well behavior. So, instead of performing an infinite integration, I perform the integration up to an upper limit that I consider rezonable, like in: gsl_function G; G.function = &dI2dmu; G.params = ¶ms; gsl_integration_qags (&G, 0, 1e2*A, 0, 1e-7, 100000, w, &result1, &error1); Getting a result that agrees with the result of Mathematica for infinite integration result definite up to 10*A = 0.005065263943958745 result up to infinity = nan Mathematica result up to infinity = 0.005065260000000000 But the GSL infinite integral keps being "nan". Any ideas? I thanks in advance for the help.
As #yonatan zuleta ochoa points out correctly, the problem is in exp(t)/pow(exp(t)+1,2). exp(t) can overflow an ieee754 DBL_MAX for values of t as low as nextafter(log(DBL_MAX), INFINITY), which is ~7.09783e2. When exp(t) == INFINITY, exp(t)/pow(exp(t)+1,2) == ∞/pow(∞+1,2) == ∞/∞ == NAN Yonatan's proposed solution is to use logarithms, which can be done as follows: exp(t)/pow(exp(t)+1,2) == exp(log(exp(t)) - log(pow(exp(t)+1,2))) == exp(t - 2*log(exp(t)+1)) == exp(t - 2*log1p(exp(t))) //<math.h> function avoiding loss of precision for log(exp(t)+1)) if exp(t) << 1.0 This is an entirely reasonable approach, avoiding NAN up to very high values of t. However, in your code, t == (Ep ± mu)/T can be INFINITY if abs(T) < 1.0 for values of x close to DBL_MAX, even if x is not infinity. In this case, the subtraction t - 2*log1p(exp(t)) turns into ∞ - ∞, which is NAN again. A different approach is to replace exp(x)/pow(exp(x)+1,2) with 1.0/(pow(exp(x)+1,2)*pow(exp(x), -1)) by dividing both denominator and numerator by exp(x) (which is not zero for any finite x). This simplifies to 1.0/(exp(x)+exp(-x)+2.0). Here is an implementation of the function avoiding NAN for values of x up to and including DBL_MAX: static double auxfun4(double a, double b, double c, double d) { return 1.0/(a*b+2.0+c*d); } double dI2dmu(double x, void * parametros) { double *p = (double *) parametros; double invT = 1.0/p[1]; double Ep = hypot(x, p[0]); double muexp = exp(p[2]*invT); double Epexp = exp(Ep*invT); double muinv = 1.0/muexp; double Epinv = 1.0/Epexp; double subterm = auxfun4(Epexp, muinv, Epinv, muexp); subterm -= auxfun4(Epexp, muexp, Epinv, muinv); double fminus = subterm*(x/Ep)*invT*(0.5/(M_PI*M_PI))*x;; return -fminus; } This implementation also uses hypot(x,M), rather than sqrt(x*x, M*M), and avoids calculating x*x by rearranging the order of multiplications/divisions to group x/Ep together. Since hypot(x,M) will be abs(x) for abs(x) >> abs(M), the term x/Ep approaches 1.0 for large x.
I think the problem here is that unlike Mathematica, C does not use arbitrary precision in computing. Then, at some point when Exp [Ep] is calculated numerical computation overflows. Now, GSL uses the transformation x = (1-t)/t, to map onto interval (0,1]. So, for t<<0 is posible to get nan results since the behavior of your function tends to indeterminations (0/0 or inf/inf,etc) for extreme values. Maybe if you write out the terms Exp[ ( Ep(x) - \Mu)/T ] / { 1 + Exp[( Ep(x) - \Mu )/T] }^2 using A/B = Exp[ Ln A - Ln B], you could get a better numerical behavior. I will try if and I have nice results, then I'll tell you. The solution As I said before, you must take care the problems arising with indeterminate forms. So, lets write out the problematic terms using the logarithmic version: double dIdmu(double x, void * parametros){ double *p,Ep,mu,M,T; p=(double *) parametros; M=p[0]; T=p[1]; mu=p[2]; Ep=sqrt(x*x+M*M); double fplus= - ( exp( (Ep - mu)/T -2.0*log(1.0 + exp((Ep - mu)/T) ) ) - exp( (Ep + mu)/T -2.0*log(1.0 + exp((Ep + mu)/T) ) ) ) * pow(x,2) / (2.* T * Ep*pow(M_PI,2)); return fplus; } and with this main function int main() { double params[3]; double resultTest2, error1Test2; gsl_integration_workspace * w = gsl_integration_workspace_alloc (10000); params[0]=0.007683; //M params[1]=0.284000;// T params[2]=0.1; //mu gsl_function dI2mu_u; dI2mu_u.function = &dIdmu; dI2mu_u.params = ¶ms; gsl_integration_qagiu (&dI2mu_u, 0.0, 1e-7, 1e-7, 10000, w, &resultTest2, &error1Test2); printf("%e\n", resultTest2); gsl_integration_workspace_free ( w); return 0; } you get the answer: -5.065288e-03. I am curious... This is how I define the function in Mathematica So comparing the answers: GSL -5.065288e-03 Mathematica -0.005065287633739702
Calculate (x exponent 0.19029) with low memory using lookup table?
I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following: A = k . (1 - (p/p0)^0.19029) k and p0 are constant, so it's all pretty simple apart from finding x^0.19029 (p/p0) ratio would always be in the range 0-1. It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%). I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table. My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved. Could this approach work? Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with: b^x = (e^(ln b))^x = e^(x.ln(b)) (See: Wikipedia - Powers via Logarithms) This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)? Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness! [EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6
Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation. In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib). Edit (Removed) (Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.) Edit 2 My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent. Edit 3 Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed): #define __LITTLE_ENDIAN #ifdef __LITTLE_ENDIAN #define __HI(x) *(1+(int*)&x) #define __LO(x) *(int*)&x #else #define __HI(x) *(int*)&x #define __LO(x) *(1+(int*)&x) #endif static const double bp[] = {1.0, 1.5,}, dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */ dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */ zero = 0.0, one = 1.0, two = 2.0, two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */ /* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */ L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */ L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */ L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */ L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */ L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */ L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */ P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */ P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */ P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */ P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */ P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */ lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */ lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */ lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */ ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */ cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */ cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */ cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/ ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */ ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/ ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/ double pow0p19029(double x) { double y = 0.19029e+00; double z,ax,z_h,z_l,p_h,p_l; double y1,t1,t2,r,s,t,u,v,w; int i,j,k,n; int hx,hy,ix,iy; unsigned lx,ly; hx = __HI(x); lx = __LO(x); hy = __HI(y); ly = __LO(y); ix = hx&0x7fffffff; iy = hy&0x7fffffff; ax = x; /* special value of x */ if(lx==0) { if(ix==0x7ff00000||ix==0||ix==0x3ff00000){ z = ax; /*x is +-0,+-inf,+-1*/ return z; } } s = one; /* s (sign of result -ve**odd) = -1 else = 1 */ double ss,s2,s_h,s_l,t_h,t_l; n = ((ix)>>20)-0x3ff; j = ix&0x000fffff; /* determine interval */ ix = j|0x3ff00000; /* normalize ix */ if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */ else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */ else {k=0;n+=1;ix -= 0x00100000;} __HI(ax) = ix; /* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */ u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */ v = one/(ax+bp[k]); ss = u*v; s_h = ss; __LO(s_h) = 0; /* t_h=ax+bp[k] High */ t_h = zero; __HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18); t_l = ax - (t_h-bp[k]); s_l = v*((u-s_h*t_h)-s_h*t_l); /* compute log(ax) */ s2 = ss*ss; r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6))))); r += s_l*(s_h+ss); s2 = s_h*s_h; t_h = 3.0+s2+r; __LO(t_h) = 0; t_l = r-((t_h-3.0)-s2); /* u+v = ss*(1+...) */ u = s_h*t_h; v = s_l*t_h+t_l*ss; /* 2/(3log2)*(ss+...) */ p_h = u+v; __LO(p_h) = 0; p_l = v-(p_h-u); z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */ z_l = cp_l*p_h+p_l*cp+dp_l[k]; /* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */ t = (double)n; t1 = (((z_h+z_l)+dp_h[k])+t); __LO(t1) = 0; t2 = z_l-(((t1-t)-dp_h[k])-z_h); /* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */ y1 = y; __LO(y1) = 0; p_l = (y-y1)*t1+y*t2; p_h = y1*t1; z = p_l+p_h; j = __HI(z); i = __LO(z); /* * compute 2**(p_h+p_l) */ i = j&0x7fffffff; k = (i>>20)-0x3ff; n = 0; if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */ n = j+(0x00100000>>(k+1)); k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */ t = zero; __HI(t) = (n&~(0x000fffff>>k)); n = ((n&0x000fffff)|0x00100000)>>(20-k); if(j<0) n = -n; p_h -= t; } t = p_l+p_h; __LO(t) = 0; u = t*lg2_h; v = (p_l-(t-p_h))*lg2+t*lg2_l; z = u+v; w = v-(z-u); t = z*z; t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5)))); r = (z*t1)/(t1-two)-(w+z*w); z = one-(r-z); __HI(z) += (n<<20); return s*z; } Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again. So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above. Edit 4 Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good. The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp: #include <ostream> #define TARG_PREC 64 #define WORK_PREC (TARG_PREC*2) #include <boost/multiprecision/cpp_dec_float.hpp> typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype; using boost::math::pow; #include <boost/math/tools/remez.hpp> boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez; dtype f(const dtype& x) { static const dtype one(1), y(0.19029); return one - pow(one - x, y); } void out(const char *descr, const dtype& x, const char *sep="") { std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl; } int main() { dtype a(0), b(0.7); // range to optimise over bool rel_error(false), pin(true); int orderN(7), orderD(0), skew(0), brake(50); int prec = 2 + (TARG_PREC * 3010LL)/10000; std::cout << std::scientific << std::setprecision(prec); p_remez.reset(new boost::math::tools::remez_minimax<dtype>( &f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC)); out("Max error in interpolated form: ", p_remez->max_error()); p_remez->set_brake(brake); unsigned i, count(50); for (i = 0; i < count; ++i) { std::cout << "Stepping..." << std::endl; dtype r = p_remez->iterate(); out("Maximum Deviation Found: ", p_remez->max_error()); out("Expected Error Term: ", p_remez->error_term()); out("Maximum Relative Change in Control Points: ", r); } boost::math::tools::polynomial<dtype> n = p_remez->numerator(); for(i = n.size(); i--; ) { out("", n[i], ","); } } Since all parts of boost that we use are header-only, simply build with: c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax We finally get the coefficients, which are after multiplication by 44330: 24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0. The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659): Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.
Instead of a lookup table, I'd use a polynomial approximation: 1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701 Or in code: double f(double x) { double fx; fx = - 1073365.91783; fx = fx*x + 8354695.40833; fx = fx*x - 29422576.6529; fx = fx*x + 61993794.537; fx = fx*x - 87079891.4988; fx = fx*x + 86005723.842; fx = fx*x - 61389954.7459; fx = fx*x + 32053170.1149; fx = fx*x - 12253383.4372; fx = fx*x + 3399819.97536; fx = fx*x - 672003.142815; fx = fx*x + 91817.6782072; fx = fx*x - 8299.75873768; fx = fx*x + 469.530204564; fx = fx*x - 16.6572179869; fx = fx*x + 0.722044145701; return fx; } I computed this in sage using the least squares approach: f(x) = 1-x^(19029/100000) # your function d = 16 # number of terms, i.e. degree + 1 A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1))) b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)]) A.solve_right(b).change_ring(RDF) Here is a plot of the error this will entail: Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations. I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part. Update Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements: 44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x) double f(double x) { double fx, x1 = 1. - x; fx = + 23024.9160933; fx = fx*x1 - 39408.6473636; fx = fx*x1 + 31379.9086193; fx = fx*x1 - 10098.7031260; fx = fx*x1 + 4339.44098317; fx = fx*x1 + 3202.85705860; fx = fx*x1 + 8442.42528906; fx = fx*x1; return fx; } I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here. I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.
Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful: This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two. It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless. #include <stdio.h> #include <math.h> float powers[102]; int main (void) { int i, as_int; double as_real, low, high, delta, approx, calcd, diff; printf ("calculating and storing:\n"); for (i=0; i<=101; i++) { as_real = pow(i/100.0, 0.19029); as_int = (int)round(65536*as_real); powers[i] = as_real; diff = 100*as_real/(as_int/65536.0); printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff); } printf ("\n"); printf ("-- interpolating in 1/10ths:\n"); for (i=0; i<1000; i++) { as_real = i/1000.0; low = powers[i/10]; high = powers[1+i/10]; delta = (high-low)/10.0; approx = low + (i%10)*delta; calcd = pow(as_real, 0.19029); diff = 100.0*approx/calcd; printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff); } return 0; }
You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.
LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy. Don't be scared by the #include <math.h>; it's just for benchmarking the code. #include <stdio.h> #include <math.h> double my_sqrt(double x) { /* Newton's method for a square root. */ int i = 0; double res = 1.0; if (x > 0) { for (i = 0; i < 10; i++) { res = 0.5 * (res + x / res); } } else { res = 0.0; } return res; } double my_152232(double x) { /* Cubic spline interpolation for x ** 1.52232. */ int i = 0; double res = 0.0; /* coefs[i] will give the cubic polynomial coefficients between x = i and x = i+1. Out of laziness, the below numbers give only a linear interpolation. You'll need to do some work and research to get the spline coefficients. */ double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0}, {-0.872526, 1.872526, 0.0, 0.0}, {-2.032706, 2.452616, 0.0, 0.0}}; if ((x >= 0) && (x < 3.0)) { i = (int) x; /* Horner's method cubic. */ res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x) + coefs[i][0]; } else if (x >= 3.0) { /* Scaled x ** 1.5 once you go off the spline. */ res = 1.024824 * my_sqrt(x * x * x); } return res; } double my_019029(double x) { return my_sqrt(my_sqrt(my_sqrt(my_152232(x)))); } int main() { int i; double x = 0.0; for (i = 0; i < 1000; i++) { x = 1e-2 * i; printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029)); } return 0; } EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series: double my_152232(double x) { double part_050000 = my_sqrt(x); double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x; return part_102232 * part_050000; } This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.
newton raphson in C
I have implemented the newton raphson algorithm for finding roots in C. I want to print out the most accurate approximation of the root as possible without going into nan land. My strategy for this is while (!(isnan(x0)) { dostuff(); } But this continues to print out the result multiple times. Ideally I would like to setup a range so that the difference between each computed x intercept approximation would stop when the previous - current is less than some range .000001 in my case. I have a possible implementation below. When I input 2.999 It takes only one step, but when I input 3.0 it takes 20 steps, this seems incorrect to me. (When I input 3.0) λ newton_raphson 3 2.500000 2.250000 2.125000 2.062500 2.031250 2.015625 2.007812 2.003906 2.001953 2.000977 2.000488 2.000244 2.000122 2.000061 2.000031 2.000015 2.000008 2.000004 2.000002 2.000001 Took 20 operation(s) to approximate a proper root of 2.000002 within a range of 0.000001 (When I input 2.999) λ newton_raphson 2.999 Took 1 operation(s) to approximate a proper root of 2.000000 within a range of 0.000001 My code: #include <stdio.h> #include <stdlib.h> #include <math.h> #define RANGE 0.000001 double absolute(double number) { if (number < 0) return -number; else return number; } double newton_raphson(double (*func)(double), double (*derivative)(double), double x0){ int count; double temp; count = 0; while (!isnan(x0)) { temp = x0; x0 = (x0 - (func(x0)/derivative(x0))); if (!isnan(x0)) printf("%f\n", x0); count++; if (absolute(temp - x0) < RANGE && count > 1) break; } printf("Took %d operation(s) to approximate a proper root of %6f\nwithin a range of 0.000001\n", count, temp); return x0; } /* (x-2)^2 */ double func(double x){ return pow(x-2.0, 2.0); } /* 2x-4 */ double derivative(double x){ return 2.0*x - 4.0; } int main(int argc, char ** argv) { double x0 = atof(argv[1]); double (*funcPtr)(double) = &func; /* this is a user defined function */ double (*derivativePtr)(double) = &derivative; /* this is the derivative of that function */ double result = newton_raphson(funcPtr, derivativePtr, x0); return 0; }
You call trunc(x0) which turns 2.999 into 2.0. Naturally, when you start at the right answer, no iteration is needed! In other words, although you intended to use 2.999 as your starting value, you actually used 2.0. Simply remove the call to trunc().
Worth pointing out: taking 20 steps to converge is also anomalous; because you are converging to a multiple root, the convergence is only linear instead of the typical quadratic convergence that Newton-Raphson gives in the general case. You can see this in the fact that your error is halved with each iteration (with the usual quadratic convergence, you would get twice as many correct digits on each iteration, and converge much, much faster).
Gaussian random number generator
I'm trying to implement a gaussian distributed random number generator in the interval [0,1]. float rand_gauss (void) { float v1,v2,s; do { v1 = 2.0 * ((float) rand()/RAND_MAX) - 1; v2 = 2.0 * ((float) rand()/RAND_MAX) - 1; s = v1*v1 + v2*v2; } while ( s >= 1.0 ); if (s == 0.0) return 0.0; else return (v1*sqrt(-2.0 * log(s) / s)); } It's pretty much a straight forward implementation of the algorithm in Knuth's 2nd volume of TAOCP 3rd edition page 122. The problem is that rand_gauss() sometimes returns values outside the interval [0,1].
Knuth describes the polar method on p 122 of the 2nd volume of TAOCP. That algorithm generates a normal distribution with mean = 0 and standard deviation = 1. But you can adjust that by multiplying by the desired standard deviation and adding the desired mean. You might find it fun to compare your code to another implementation of the polar method in the C-FAQ.
Change your if statement to (s >= 1.0 || s == 0.0). Better yet, use a break as seen in the following example for a SIMD Gaussian random number generating returning a complex pair (u,v). This uses the Mersenne twister random number generator dsfmt(). If you only want a single, real, random-number, return only u and save the v for the next pass. inline static void randn(double *u, double *v) { double s, x, y; // SIMD Marsaglia polar version for complex u and v while (1){ x = dsfmt_genrand_close_open(&dsfmt) - 1.; y = dsfmt_genrand_close_open(&dsfmt) - 1.; s = x*x + y*y; if (s < 1) break; } s = sqrt(-2.0*log(s)/s); *u = x*s; *v = y*s; return; } This algorithm is surprisingly fast. Execution times for computing two random numbers (u,v) for four different Gaussian random number generators are: Times for delivering two Gaussian numbers (u + iv) i7-2600K # 4GHz, gcc -Wall -Ofast -msse2 .. gsl_ziggurat = 20.3 (ns) Box-Muller = 78.8 (ns) Box-Muller with fast_sin fast_cos = 28.1 (ns) SIMD Marsaglia polar = 35.0 (ns) The fast_sin and fast_cos polynomial routines of Charles K. Garrett speed up the Box-Muller computation by a factor 2.9 using a nested polynomial implementation of cos() and sin(). The SIMD Box Muller and polar algorithms are certainly competitive. Also they can be parallelized easily. Using gcc -Ofast -S, the assembly code dump shows that the square root is the SIMD SSE2: sqrt --> sqrtsd %xmm0, %xmm0 Comment: it is really hard and frustrating to get accurate timings with gcc5, but I think these are ok: as of 2/3/2016: DLW [1] Related link: c malloc array pointer return in cython [2] A comparison of algorithms, but not necessarily for SIMD versions: http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf [3] Charles K. Garrett: http://krisgarrett.net/papers/l2approx.pdf