DSP libraries - RFFT - strange results - c

Recently I've been trying to do FFT calculations on my STM32F4-Discovery evaluation board then send it to PC. I have looked into my problem - I think that I'm doing something wrong with FFT functions provided by manufacturer.
I'm using CMSIS-DSP libraries.
For now I've have been generating samples with code (if that works correct I'll do sampling by microphone).
I'm using arm_rfft_fast_f32 as my data are going to be floats in the future, but results I get in my output array are insane (I think) - I'm getting frequencies below 0.
number_of_samples = 512; (l_probek in code)
dt = 1/freq/number_of_samples
Here is my code
float32_t buffer_input[l_probek];
uint16_t i;
uint8_t mode;
float32_t dt;
float32_t freq;
bool DoFlag = false;
bool UBFlag = false;
uint32_t rozmiar = 4*l_probek;
union
{
float32_t f[l_probek];
uint8_t b[4*l_probek];
}data_out;
union
{
float32_t f[l_probek];
uint8_t b[4*l_probek];
}data_mag;
union
{
float32_t f;
uint8_t b[4];
}czest_rozdz;
/* Pointers ------------------------------------------------------------------*/
arm_rfft_fast_instance_f32 S;
arm_cfft_radix4_instance_f32 S_CFFT;
uint16_t output;
/* ---------------------------------------------------------------------------*/
int main(void)
{
freq = 5000;
dt = 0.000000390625;
_GPIO();
_LED();
_NVIC();
_EXTI(0);
arm_rfft_fast_init_f32(&S, l_probek);
GPIO_SetBits(GPIOD, LED_Green);
mode = 2;
//----------------- Infinite loop
while (1)
{
if(true)//(UBFlag == true)
for(i=0; i<l_probek; ++i)
{
buffer_input[i] = (float32_t) 15*sin(2*PI*freq*i*dt);
}
//Obliczanie FFT
arm_rfft_fast_f32(&S, buffer_input, data_out.f, 0);
//Obliczanie modulow
arm_cmplx_mag_f32(data_out.f, data_mag.f, l_probek);
USART_putdata(USART1, data_out.b, data_mag.b, rozmiar);
//USART_putdata(USART1, czest_rozdz.b, data_mag.b, rozmiar);
GPIO_ToggleBits(GPIOD, LED_Orange);
//mode++;
//UBFlag = false;
}
}
}

I'm using arm_rfft_fast_f32 as my data are going to be floats in the future, but results I get in my output array are insane (I think) - I'm getting frequencies below 0.
The arm_rfft_fast_f32 function does not return frequencies, but rather complex-valued coefficients computed using the Fast Fourier Transform (FFT). It is thus perfectly reasonable for those coefficients to be negative. More specifically, the expected coefficients for your single-cycle sin test tone input with an amplitude of 15 would be:
0.0, 0.0; // special case packing real-valued X[0] and X[N/2]
0.0, -3840.0; // X[1]
0.0, 0.0; // X[2]
0.0, 0.0; // X[3]
...
0.0, 0.0; // X[255]
Note that as indicated in the documentation the first two outputs correspond to the purely real coefficients X[0] and X[N/2] (you should be particularly careful about this special case in your subsequent call to arm_cmplx_mag_f32; see last point below).
The frequency of each of those frequency components are given by k*fs/N, where N is the number of samples (in your case l_probek) and fs = 1/dt is the sampling rate (in your case freq*l_probek):
X[0] -> 0*freq*l_probek/l_probek = 0
X[1] -> 1*freq*l_probek/l_probek = freq = 5000
X[2] -> 2*freq*l_probek/l_probek = 2*freq = 10000
X[3] -> 3*freq*l_probek/l_probek = 2*freq = 15000
...
Finally, due to the special packing of the first two values, you need to be careful when computing the N/2+1 magnitudes:
// General case for the magnitudes
arm_cmplx_mag_f32(data_out.f+2, data_mag.f+1, l_probek/2 - 1);
// Handle special cases
data_mag.f[0] = data_out.f[0];
data_mag.f[l_probek/2] = data_out.f[1];

As a follow-up to the above answer, which is awesome, some further clarifications which took me an age to figure out.
The frequency bins are centered on the target frequency, so for instance in the example above X[0] represents -2500Hz to 2500Hz, centered on zero, X[1] is 2500Hz to 7500Hz centered on 5000Hz and so on
It's common to interpolate frequencies within the bin by looking at the energy of the adjacent bins (see https://dspguru.com/dsp/howtos/how-to-interpolate-fft-peak/) if you do this you will need to make sure that your magnitude array is large enough for the bins + Nyquist and that the bin above Nyquist is 0, but note many interpolation techniques require the complex values (e.q. Quinn, Jacobson) so make sure you interpolate before finding the magnitudes.
The special case code above works because there is no complex component of the DC and Nyquist values and thus the magnitude is simply the real part
There is a bug in the code above however - although the imaginary parts of the DC and Nyquist components is always zero, the real part could still be negative, so you need to take the absolute value to get the magnitude:
// Handle special cases
data_mag.f[0] = fabs(data_out.f[0]);
data_mag.f[l_probek/2] = fabs(data_out.f[1]);

Related

Inverse FFT with CMSIS is wrong

I am attempting to perform an FFT on a signal and use the resulting data to retrieve the original samples via an IFFT. I am using the CMSIS DSP library on an STM32 with a M3.
My issue is understanding the scaling that occurs with the FFT, and also how to get a correct IFFT. Currently the IFFT results in a similar wave as the input, but points are scaled anywhere between 120x-140x of the original. Is this simply the result of precision errors of q15? Am I too scale the IFFT results by 7 bits? My code is below
The documentation also mentions "For the RIFFT, the source buffer must at least have length fftLenReal + 2. The last two elements must be equal to what would be generated by the RFFT: (pSrc[0] - pSrc[1]) >> 1 and 0". What is this for? Applying these operations to FFT_SIZE2 - 2, and FFT_SIZE2 - 1 respectively did not change the results of the IFFT at all.
//128 point FFT
#define FFT_SIZE 128
arm_rfft_instance_q15 fft_instance;
arm_rfft_instance_q15 ifft_instance;
//time domain signal buffers
float32_t sinetbl_in[FFT_SIZE];
float32_t sinetbl_out[FFT_SIZE];
//a copy for comparison after RFFT since function modifies input buffer
volatile q15_t fft_in_buf_cpy[FFT_SIZE];
q15_t fft_in_buf[FFT_SIZE];
//output for FFT, RFFT provides real and complex data organized as re[0], im[0], re[1], im[1]
q15_t fft_out_buf[FFT_SIZE*2];
q15_t fft_out_buf_mag[FFT_SIZE*2];
//inverse fft buffer result
q15_t ifft_out_buf[FFT_SIZE];
//generate 1kHz sinewave with a sample frequency of 8kHz for 128 samples, amplitude is 1
for(int i = 0; i < FFT_SIZE; ++i){
sinetbl_in[i] = arm_sin_f32(2*3.14*1000 *i/8000);
sinetbl_out[i] = 0;
}
//convert buffer to q15 (not enough flash to use f32 fft functions)
arm_float_to_q15(sinetbl_in, fft_in_buf, FFT_SIZE);
memcpy(fft_in_buf_cpy, fft_in_buf, FFT_SIZE*2);
//perform RFFT
arm_rfft_init_q15(&fft_instance, FFT_SIZE, 0, 1);
arm_rfft_q15(&fft_instance, fft_in_buf, fft_out_buf);
//calculate magnitude, skip 1st real and img numbers as they are DC and both real
arm_cmplx_mag_q15(fft_out_buf + 2, fft_out_buf_mag + 1, FFT_SIZE/2-1);
//weird operations described by documentation, does not change results
//fft_out_buf[FFT_SIZE*2 - 2] = (fft_out_buf[0] - fft_out_buf[1]) >> 1;
//fft_out_buf[FFT_SIZE*2 - 1] = 0;
//perform inverse FFT
arm_rfft_init_q15(&ifft_instance, FFT_SIZE, 1, 1);
arm_rfft_q15(&ifft_instance, fft_out_buf, ifft_out_buf);
//closest approximation to get to original scaling
//arm_shift_q15(ifft_out_buf, 7, ifft_out_buf, FFT_SIZE);
//convert back to float for comparison with input
arm_q15_to_float(ifft_out_buf, sinetbl_out, FFT_SIZE);
I feel like I answered my own question with the precision comment, but I'd like to be sure. Am I doing this FFT stuff right?
Thanks in advance
As Cris pointed out some libraries skip the normalization process. CMSIS DSP is one of those libraries as it is intended to be fast. For CMSIS, depending on the FFT size you must left shift your data a certain amount to get back to the original range. In my case with a FFT size of 128 and also the magnitude calculation, it was 7 as I originally surmised.

Not possible to do CFFT Frequency binning with CMSIS on STM32?

At the moment I am attempting to implement a program for finding 3 frequencies (xs = 30.1 kHz, ys = 28.3 kHz and zs = 25.9 kHz) through the use of the CMSIS pack on the STM32F411RE board. I cannot get the Complex FFT (CFFT) and complex magnitude working correctly.
In accordance with the freqeuncy bins I generate an array containing these frequencies, so that I can manually lookup which index bins the signals xs, ys and zs are on. I then use this index to look at the 3 fft outcomes (Xfft, Yfft, Zfft) to find the outcomes for these signals, but they dont match up.
I use the following order of functions:
DMA ADC Buffer: HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef* hadc)
Freqeuncy bins in binfreqs
Change ADC input to float Xfft
CFFT_F32: arm_cfft_f32(&arm_cfft_sR_f32_len1024, Xfft, 0, 0);
Complex Mag: arm_cmplx_mag_f32(Xfft, Xdsp, fftLen);
// ADC Stuff done via DMA, working correctly
int main(void)
{
HAL_Init();
SystemClock_Config();
MX_GPIO_Init();
MX_DMA_Init();
MX_ADC1_Init();
MX_USART2_UART_Init();
HAL_ADC_Start_DMA(&hadc1, adc_buffer, bufferLen); // adc_buffer needs to be an uint32_t
while (1)
{
/**
* Generate the frequencies
*/
for (int binfreqs = 0; binfreqs < fftLen; binfreqs++) // Generates the frequency bins to relate the amplitude to an actual value, rather than a bin frequency value
{
fftFreq[binfreqs] = binfreqs * binSize;
}
/*
* Find the amplitudes associated with the 3 emitter frequencies and store in an array for each axis. By default these arrays are generated with signal strength 0
* and with frequency index at 0: because of system limits these will indicate invalid values, as system range is from 10 - 60 kHz.
*/
volatile int32_t X_mag[3][4] = // x axis values: [index][frequency][signal_strength][phase]
{
{0, Xfreq, 0, 0}, // For x-freq index [0][0], frequency [0][1] associated with 1st biggest amplitude [0][2], phase [0][3]
{0, Yfreq, 0, 0}, // Ditto for y-freq
{0, Zfreq, 0, 0} // Ditto for z-freq
};
/*
* Finds the index in fftFreq corresponding to respectively X, Y and Z emitter frequencies
*/
for(int binSearch = 0; binSearch < fftLen; binSearch++)
{
if(fftFreq[binSearch] == Xfreq) // Find index for X emitter frequency
{
X_mag[0][0] = binSearch;
}
if(fftFreq[binSearch] == Yfreq) // Find index for Y emitter frequency
{
X_mag[1][0] = binSearch;
}
if(fftFreq[binSearch] == Zfreq) // Find index for Z emitter frequency
{
X_mag[2][0] = binSearch;
}
}
Signal processing
/* Signal processing algorithms --------------------------------------------------
*
* Only to be run once with fresh data from the buffer, [do not run continuous] or position / orientation data will be repeated.
* So only run once when conversionPaused
*/
if(conversionPaused)
{
/*
* Convert signal to voltage (12-bit, 4096)
*/
for (int floatVals = 0; floatVals < fftLen; floatVals++)
{
Xfft[floatVals] = (float) Xin[floatVals]; * 3.6 / 4096
}
/*
* Fourier transform
*/
arm_cfft_f32(&arm_cfft_sR_f32_len1024, Xfft, 0, 0); // Calculate complex fourier transform of x time signal, processing occurs in place
for (int fix_fft = 0 ; fix_fft < half_fftLen ; fix_fft++)
{
Xfft[fix_fft] = 2 * Xfft[fix_fft] / fftLen;
Xfft[fix_fft + half_fftLen] = 0;
}
/*
* Amplitude calculation
*/
arm_cmplx_mag_f32(Xfft, Xdsp, fftLen); // Calculate the magnitude of the fourier transform for x axis
/*
* Finds all signal strengths for allocated frequency indexes
*/
for(int strength_index = 0; strength_index < 3; strength_index++) // Loops through xyz frequencies for all 3 magnetometer axis
{
int x_temp_index = X_mag[strength_index][0]; // temp int necessary to store the strength, otherwise infinite loop?
X_mag[strength_index][2] = Xfft[x_temp_index]; // = Xfft[2*x_temp_index];
}
conversionPaused = 0;
}
} // While() end
} // Main() end
I do not know how I am to calculate the frequency bins for this combination of cfft and complex magnitude, as I would expect the even indexes of the array to hold the real values and the odd indexes of the array to hold the imaginary phase values. I reference some 1 2 3 examples but could not make out what I am doing wrong with my code.
However as per the images when applying an input signal of 30.1 kHz neither the 301 bin index or the 602 bin index holds the corresponding output expected?
301 bin index
602 bin index
EDIT:
I have since tried to implement the arm_cfft_f32 example given here. This latter example is completely broken as the external 10 kHz dataset is no longer included by default and trying to include it is not possible, as the program behaves poorly and keeps erroring about a return data type that is not even present in the first place. Thus I cannot use the example program given for this: it also appears to be 4 years out of date, so that is not surprising.
The arm_max_f32() function also proved not fruitful as it keeps homing in on the noise generated at bin 0 via using an analog generated signal. Manually setting this bin 0 to equal 0 then upsets the algorithm which starts pointing to random values that are not even the largest value present in the system.
Even when going manually through the CFFT data and magnitude it appears as if they are not working correctly. There are random noise values all over the spectrum parts, whilst the oscilloscope confirms that large outcomes should only be present at 0 Hz and the selected signal generator frequency (thus corresponding to a frequency bin).
Using CMSIS is extremely frustrating for me because of the little documentation and examples available, which is then further reduced by most of it simply not working (without major modification).

CORDIC Arcsine implementation fails

I have recently implemented a library of CORDIC functions to reduce the required computational power (my project is based on a PowerPC and is extremely strict in its execution time specifications). The language is ANSI-C.
The other functions (sin/cos/atan) work within accuracy limits both in 32 and in 64 bit implementations.
Unfortunately, the asin() function fails systematically for certain inputs.
For testing purposes I have implemented an .h file to be used in a simulink S-Function. (This is only for my convenience, you can compile the following as a standalone .exe with minimal changes)
Note: I have forced 32 iterations because I am working in 32 bit precision and the maximum possible accuracy is required.
Cordic.h:
#include <stdio.h>
#include <stdlib.h>
#define FLOAT32 float
#define INT32 signed long int
#define BIT_XOR ^
#define CORDIC_1K_32 0x26DD3B6A
#define MUL_32 1073741824.0F /*needed to scale float -> int*/
#define INV_MUL_32 9.313225746E-10F /*needed to scale int -> float*/
INT32 CORDIC_CTAB_32 [] = {0x3243f6a8, 0x1dac6705, 0x0fadbafc, 0x07f56ea6, 0x03feab76, 0x01ffd55b, 0x00fffaaa, 0x007fff55,
0x003fffea, 0x001ffffd, 0x000fffff, 0x0007ffff, 0x0003ffff, 0x0001ffff, 0x0000ffff, 0x00007fff,
0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f,
0x0000003f, 0x0000001f, 0x0000000f, 0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000};
/* CORDIC Arcsine Core: vectoring mode */
INT32 CORDIC_asin(INT32 arc_in)
{
INT32 k;
INT32 d;
INT32 tx;
INT32 ty;
INT32 x;
INT32 y;
INT32 z;
x=CORDIC_1K_32;
y=0;
z=0;
for (k=0; k<32; ++k)
{
d = (arc_in - y)>>(31);
tx = x - (((y>>k) BIT_XOR d) - d);
ty = y + (((x>>k) BIT_XOR d) - d);
z += ((CORDIC_CTAB_32[k] BIT_XOR d) - d);
x = tx;
y = ty;
}
return z;
}
/* Wrapper function for scaling in-out of cordic core*/
FLOAT32 asin_wrap(FLOAT32 arc)
{
return ((FLOAT32)(CORDIC_asin((INT32)(arc*MUL_32))*INV_MUL_32));
}
This can be called in a manner similar to:
#include "Cordic.h"
#include "math.h"
void main()
{
y1 = asin_wrap(value_32); /*my implementation*/
y2 = asinf(value_32); /*standard math.h for comparison*/
}
The results are as shown:
Top left shows the [-1;1] input over 2000 steps (0.001 increments), bottom left the output of my function, bottom right the standard output and top right the difference of the two outputs.
It is immediate to see that the error is not within 32 bit accuracy.
I have analysed the steps performed (and the intermediate results) by my code and it seems to me that at a certain point the value of y is "close enough" to the initial value of arc_in and what could be related to a bit-shift causes the solution to diverge.
My questions:
I am at a loss, is this error inherent in the CORDIC implementation or have I made a mistake in the implementation? I was expecting the decrease of accuracy near the extremes, but those spikes in the middle are quite unexpected. (the most notable ones are just beyond +/- 0.6, but even removed these there are more at smaller values, albeit not as pronounced)
If it is something part of the CORDIC implementation, are there known workarounds?
EDIT:
Since some comment mention it, yes, I tested the definition of INT32, even writing
#define INT32 int32_T
does not change the results by the slightest amount.
The computation time on the target hardware has been measured by hundreds of repetitions of block of 10.000 iterations of the function with random input in the validity range. The observed mean results (for one call of the function) are as follows:
math.h asinf() 100.00 microseconds
CORDIC asin() 5.15 microseconds
(apparently the previous test had been faulty, a new cross-test has obtained no better than an average of 100 microseconds across the validity range)
I apparently found a better implementation. It can be downloaded in matlab version here and in C here. I will analyse more its inner workings and report later.
To review a few things mentioned in the comments:
The given code outputs values identical to another CORDIC implementation. This includes the stated inaccuracies.
The largest error is as you approach arcsin(1).
The second largest error is that the values of arcsin(0.60726) to arcsin(0.68514) all return 0.754805.
There are some vague references to inaccuracies in the CORDIC method for some functions including arcsin. The given solution is to perform "double-iterations" although I have been unable to get this to work (all values give a large amount of error).
The alternate CORDIC implemention has a comment /* |a| < 0.98 */ in the arcsin() implementation which would seem to reinforce that there is known inaccuracies close to 1.
As a rough comparison of a few different methods consider the following results (all tests performed on a desktop, Windows7 computer using MSVC++ 2010, benchmarks timed using 10M iterations over the arcsin() range 0-1):
Question CORDIC Code: 1050 ms, 0.008 avg error, 0.173 max error
Alternate CORDIC Code (ref): 2600 ms, 0.008 avg error, 0.173 max error
atan() CORDIC Code: 2900 ms, 0.21 avg error, 0.28 max error
CORDIC Using Double-Iterations: 4700 ms, 0.26 avg error, 0.917 max error (???)
Math Built-in asin(): 200 ms, 0 avg error, 0 max error
Rational Approximation (ref): 250 ms, 0.21 avg error, 0.26 max error
Linear Table Lookup (see below) 100 ms, 0.000001 avg error, 0.00003 max error
Taylor Series (7th power, ref): 300 ms, 0.01 avg error, 0.16 max error
These results are on a desktop so how relevant they would be for an embedded system is a good question. If in doubt, profiling/benchmarking on the relevant system would be advised. Most solutions tested don't have very good accuracy over the range (0-1) and all but one are actually slower than the built-in asin() function.
The linear table lookup code is posted below and is my usual method for any expensive mathematical function when speed is desired over accuracy. It simply uses a 1024 element table with linear interpolation. It seems to be both the fastest and most accurate of all methods tested, although the built-in asin() is not much slower really (test it!). It can easily be adjusted for more or less accuracy by changing the size of the table.
// Please test this code before using in anything important!
const size_t ASIN_TABLE_SIZE = 1024;
double asin_table[ASIN_TABLE_SIZE];
int init_asin_table (void)
{
for (size_t i = 0; i < ASIN_TABLE_SIZE; ++i)
{
float f = (float) i / ASIN_TABLE_SIZE;
asin_table[i] = asin(f);
}
return 0;
}
double asin_table (double a)
{
static int s_Init = init_asin_table(); // Call automatically the first time or call it manually
double sign = 1.0;
if (a < 0)
{
a = -a;
sign = -1.0;
}
if (a > 1) return 0;
double fi = a * ASIN_TABLE_SIZE;
double decimal = fi - (int)fi;
size_t i = fi;
if (i >= ASIN_TABLE_SIZE-1) return Sign * 3.14159265359/2;
return Sign * ((1.0 - decimal)*asin_table[i] + decimal*asin_table[i+1]);
}
The "single rotate" arcsine goes badly wrong when the argument is just greater than the initial value of 'x', where that is the magical scaling factor -- 1/An ~= 0.607252935 ~= 0x26DD3B6A.
This is because, for all arguments > 0, the first step always has y = 0 < arg, so d = +1, which sets y = 1/An, and leaves x = 1/An. Looking at the second step:
if arg <= 1/An, then d = -1, and the steps which follow converge to a good answer
if arg > 1/An, then d = +1, and this step moves further away from the right answer, and for a range of values a little bigger than 1/An, the subsequent steps all have d = -1, but are unable to correct the result :-(
I found:
arg = 0.607 (ie 0x26D91687), relative error 7.139E-09 -- OK
arg = 0.608 (ie 0x26E978D5), relative error 1.550E-01 -- APALLING !!
arg = 0.685 (ie 0x2BD70A3D), relative error 2.667E-04 -- BAD !!
arg = 0.686 (ie 0x2BE76C8B), relative error 1.232E-09 -- OK, again
The descriptions of the method warn about abs(arg) >= 0.98 (or so), and I found that somewhere after 0.986 the process fails to converge and the relative error jumps to ~5E-02 and hits 1E-01 (!!) at arg=1 :-(
As you did, I also found that for 0.303 < arg < 0.313 the relative error jumps to ~3E-02, and reduces slowly until things return to normal. (In this case step 2 overshoots so far that the remaining steps cannot correct it.)
So... the single rotate CORDIC for arcsine looks rubbish to me :-(
Added later... when I looked even closer at the single rotate CORDIC, I found many more small regions where the relative error is BAD...
...so I would not touch this as a method at all... it's not just rubbish, it's useless.
BTW: I thoroughly recommend "Software Manual for the Elementary Functions", William Cody and William Waite, Prentice-Hall, 1980. The methods for calculating the functions are not so interesting any more (but there is a thorough, practical discussion of the relevant range-reductions required). However, for each function they give a good test procedure.
The additional source I linked at the end of the question apparently contains the solution.
The proposed code can be reduced to the following:
#define M_PI_2_32 1.57079632F
#define SQRT2_2 7.071067811865476e-001F /* sin(45°) = cos(45°) = sqrt(2)/2 */
FLOAT32 angles[] = {
7.8539816339744830962E-01F, 4.6364760900080611621E-01F, 2.4497866312686415417E-01F, 1.2435499454676143503E-01F,
6.2418809995957348474E-02F, 3.1239833430268276254E-02F, 1.5623728620476830803E-02F, 7.8123410601011112965E-03F,
3.9062301319669718276E-03F, 1.9531225164788186851E-03F, 9.7656218955931943040E-04F, 4.8828121119489827547E-04F,
2.4414062014936176402E-04F, 1.2207031189367020424E-04F, 6.1035156174208775022E-05F, 3.0517578115526096862E-05F,
1.5258789061315762107E-05F, 7.6293945311019702634E-06F, 3.8146972656064962829E-06F, 1.9073486328101870354E-06F,
9.5367431640596087942E-07F, 4.7683715820308885993E-07F, 2.3841857910155798249E-07F, 1.1920928955078068531E-07F,
5.9604644775390554414E-08F, 2.9802322387695303677E-08F, 1.4901161193847655147E-08F, 7.4505805969238279871E-09F,
3.7252902984619140453E-09F, 1.8626451492309570291E-09F, 9.3132257461547851536E-10F, 4.6566128730773925778E-10F};
FLOAT32 arcsin_cordic(FLOAT32 t)
{
INT32 i;
INT32 j;
INT32 flip;
FLOAT32 poweroftwo;
FLOAT32 sigma;
FLOAT32 sign_or;
FLOAT32 theta;
FLOAT32 x1;
FLOAT32 x2;
FLOAT32 y1;
FLOAT32 y2;
flip = 0;
theta = 0.0F;
x1 = 1.0F;
y1 = 0.0F;
poweroftwo = 1.0F;
/* If the angle is small, use the small angle approximation */
if ((t >= -0.002F) && (t <= 0.002F))
{
return t;
}
if (t >= 0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
/* The inv_sqrt() is the famous Fast Inverse Square Root from the Quake 3 engine
here used with 3 (!!) Newton iterations */
if ((t >= SQRT2_2) || (t <= -SQRT2_2))
{
t = 1.0F/inv_sqrt(1-t*t);
flip = 1;
}
if (t>=0.0F)
{
sign_or = 1.0F;
}
else
{
sign_or = -1.0F;
}
for ( j = 0; j < 32; j++ )
{
if (y1 > t)
{
sigma = -1.0F;
}
else
{
sigma = 1.0F;
}
/* Here a double iteration is done */
x2 = x1 - (sigma * poweroftwo * y1);
y2 = (sigma * poweroftwo * x1) + y1;
x1 = x2 - (sigma * poweroftwo * y2);
y1 = (sigma * poweroftwo * x2) + y2;
theta += 2.0F * sigma * angles[j];
t *= (1.0F + poweroftwo * poweroftwo);
poweroftwo *= 0.5F;
}
/* Remove bias */
theta -= sign_or*4.85E-8F;
if (flip)
{
theta = sign_or*(M_PI_2_32-theta);
}
return theta;
}
The following is to be noted:
It is a "Double-Iteration" CORDIC implementation.
The angles table thus differs in construction from the old table.
And the computation is done in floating point notation, this will cause a major increase in computation time on the target hardware.
A small bias is present in the output, removed via the theta -= sign_or*4.85E-8F; passage.
The following picture shows the absolute (left) and relative errors (right) of the old implementation (top) vs the implementation contained in this answer (bottom).
The relative error is obtained only by dividing the CORDIC output with the output of the built-in math.h implementation. It is plotted around 1 and not 0 for this reason.
The peak relative error (when not dividing by zero) is 1.0728836e-006.
The average relative error is 2.0253509e-007 (almost in accordance to 32 bit accuracy).
For convergence of iterative process it is necessary that any "wrong" i-th
iteration could be "corrected" in the subsequent (i+1)-th, (i+2)-th, (i+3)-th,
etc. etc. iterations. Or, in other words, at least a half of the "wrong"
i-th iteration could be corrected in the next (i+1)-th iteration.
For atan(1/2^i) this condition is satisfied, i.e.:
atan(1/2^(i+1)) > 1/2*atan(1/2^i)
Read more at
http://cordic-bibliography.blogspot.com/p/double-iterations-in-cordic.html
and:
http://baykov.de/CORDIC1972.htm
(note I'm the author of those pages)

Calculating the Power spectral density

I am trying to get the PSD of a real data set by making use of fftw3 library
To test I wrote a small program as shown below ,that generates the a signal which follows sinusoidal function
#include <stdio.h>
#include <math.h>
#define PI 3.14
int main (){
double value= 0.0;
float frequency = 5;
int i = 0 ;
double time = 0.0;
FILE* outputFile = NULL;
outputFile = fopen("sinvalues","wb+");
if(outputFile==NULL){
printf(" couldn't open the file \n");
return -1;
}
for (i = 0; i<=5000;i++){
value = sin(2*PI*frequency*zeit);
fwrite(&value,sizeof(double),1,outputFile);
zeit += (1.0/frequency);
}
fclose(outputFile);
return 0;
}
Now I'm reading the output file of above program and trying to calculate its PSD like as shown below
#include <stdio.h>
#include <fftw3.h>
#include <complex.h>
#include <stdlib.h>
#include <math.h>
#define PI 3.14
int main (){
FILE* inp = NULL;
FILE* oup = NULL;
double* value;// = 0.0;
double* result;
double spectr = 0.0 ;
int windowsSize =512;
double power_spectrum = 0.0;
fftw_plan plan;
int index=0,i ,k;
double multiplier =0.0;
inp = fopen("1","rb");
oup = fopen("psd","wb+");
value=(double*)malloc(sizeof(double)*windowsSize);
result = (double*)malloc(sizeof(double)*(windowsSize)); // what is the length that I have to choose here ?
plan =fftw_plan_r2r_1d(windowsSize,value,result,FFTW_R2HC,FFTW_ESTIMATE);
while(!feof(inp)){
index =fread(value,sizeof(double),windowsSize,inp);
// zero padding
if( index != windowsSize){
for(i=index;i<windowsSize;i++){
value[i] = 0.0;
}
}
// windowing Hann
for (i=0; i<windowsSize; i++){
multiplier = 0.5*(1-cos(2*PI*i/(windowsSize-1)));
value[i] *= multiplier;
}
fftw_execute(plan);
for(i = 0;i<(windowsSize/2 +1) ;i++){ //why only tell the half size of the window
power_spectrum = result[i]*result[i] +result[windowsSize/2 +1 -i]*result[windowsSize/2 +1 -i];
printf("%lf \t\t\t %d \n",power_spectrum,i);
fprintf(oup," %lf \n ",power_spectrum);
}
}
fclose(oup);
fclose(inp);
return 0;
}
Iam not sure about the correctness of the way I am doing this, but below are the results i have obtained:
Can any one help me in tracing the errors of the above approach
Thanks in advance
*UPDATE
after hartmut answer I'vve edited the code but still got the same result :
and the input data look like :
UPDATE
after increasing the sample frequencyand a windows size of 2048 here is what I've got :
UPDATE
after using the ADD-ON here how the result looks like using the window :
You combine the wrong output values to power spectrum lines. There are windowsSize / 2 + 1 real values at the beginning of result and windowsSize / 2 - 1 imaginary values at the end in reverse order. This is because the imaginary components of the first (0Hz) and last (Nyquist frequency) spectral lines are 0.
int spectrum_lines = windowsSize / 2 + 1;
power_spectrum = (double *)malloc( sizeof(double) * spectrum_lines );
power_spectrum[0] = result[0] * result[0];
for ( i = 1 ; i < windowsSize / 2 ; i++ )
power_spectrum[i] = result[i]*result[i] + result[windowsSize-i]*result[windowsSize-i];
power_spectrum[i] = result[i] * result[i];
And there is a minor mistake: You should apply the window function only to the input signal and not to the zero-padding part.
ADD-ON:
Your test program generates 5001 samples of a sinusoid signal and then you read and analyse the first 512 samples of this signal. The result of this is that you analyse only a fraction of a period. Due to the hard cut-off of the signal it contains a wide spectrum of energy with almost unpredictable energy levels, because you not even use PI but only 3.41 which is not precise enough to do any predictable calculation.
You need to guarantee that an integer number of periods is exactly fitting into your analysis window of 512 samples. Therefore, you should change this in your test signal creation program to have exactly numberOfPeriods periods in your test signal (e.g. numberOfPeriods=1 means that one period of the sinoid has a period of exactly 512 samples, 2 => 256, 3 => 512/3, 4 => 128, ...). This way, you are able to generate energy at a specific spectral line. Keep in mind that windowSize must have the same value in both programs because different sizes make this effort useless.
#define PI 3.141592653589793 // This has to be absolutely exact!
int windowSize = 512; // Total number of created samples in the test signal
int numberOfPeriods = 64; // Total number of sinoid periods in the test signal
for ( n = 0 ; n < windowSize ; ++n ) {
value = sin( (2 * PI * numberOfPeriods * n) / windowSize );
fwrite( &value, sizeof(double), 1, outputFile );
}
Some remarks to your expected output function.
Your input is a function with pure real values.
The result of a DFT has complex values.
So you have to declare the variable out not as double but as fftw_complex *out.
In general the number of dft input values is the same as the number of output values.
However, the output spectrum of a dft contains the complex amplitudes for positive
frequencies as well as for negative frequencies.
In the special case for pure real input, the amplitudes of the positive frequencies are
conjugated complex values of the amplitudes of the negative frequencies.
For that, only the frequencies of the positive spectrum are calculated,
which means that the number of the complex output values is the half of
the number of real input values.
If your input is a simple sinewave, the spectrum contains only a single frequency component.
This is true for 10, 100, 1000 or even more input samples.
All other values are zero. So it doesn't make any sense to work with a huge number of input values.
If the input data set contains a single period, the complex output value is
contained in out[1].
If the If the input data set contains M complete periods, in your case 5,
so the result is stored in out[5]
I did some modifications on your code. To make some facts more clear.
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include "fftw3.h"
int performDFT(int nbrOfInputSamples, char *fileName)
{
int nbrOfOutputSamples;
double *in;
fftw_complex *out;
fftw_plan p;
// In the case of pure real input data,
// the output values of the positive frequencies and the negative frequencies
// are conjugated complex values.
// This means, that there no need for calculating both.
// If you have the complex values for the positive frequencies,
// you can calculate the values of the negative frequencies just by
// changing the sign of the value's imaginary part
// So the number of complex output values ( amplitudes of frequency components)
// are the half of the number of the real input values ( amplitutes in time domain):
nbrOfOutputSamples = ceil(nbrOfInputSamples/2.0);
// Create a plan for a 1D DFT with real input and complex output
in = (double*) fftw_malloc(sizeof(double) * nbrOfInputSamples);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * nbrOfOutputSamples);
p = fftw_plan_dft_r2c_1d(nbrOfInputSamples, in, out, FFTW_ESTIMATE);
// Read data from input file to input array
FILE* inputFile = NULL;
inputFile = fopen(fileName,"r");
if(inputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
}
double value;
int idx = 0;
while(!feof(inputFile)){
fscanf(inputFile, "%lf", &value);
in[idx++] = value;
}
fclose(inputFile);
// Perform the dft
fftw_execute(p);
// Print output results
char outputFileName[] = "dftvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(outputFileName,"w+");
if(outputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", outputFileName);
return -1;
}
double realVal;
double imagVal;
double powVal;
double absVal;
fprintf(stdout, " Frequency Real Imag Abs Power\n");
for (idx=0; idx<nbrOfOutputSamples; idx++) {
realVal = out[idx][0]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
imagVal = out[idx][1]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
powVal = 2*(realVal*realVal + imagVal*imagVal);
absVal = sqrt(powVal/2);
if (idx == 0) {
powVal /=2;
}
fprintf(outputFile, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
fprintf(stdout, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
// The total signal power of a frequency is the sum of the power of the posive and the negative frequency line.
// Because only the positive spectrum is calculated, the power is multiplied by two.
// However, there is only one single line in the prectrum for DC.
// This means, the DC value must not be doubled.
}
fclose(outputFile);
// Clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
return 0;
}
int main(int argc, const char * argv[]) {
// Set basic parameters
float timeIntervall = 1.0; // in seconds
int nbrOfSamples = 50; // number of Samples per time intervall, so the unit is S/s
double timeStep = timeIntervall/nbrOfSamples; // in seconds
float frequency = 5; // frequency in Hz
// The period time of the signal is 1/5Hz = 0.2s
// The number of samples per period is: nbrOfSamples/frequency = (50S/s)/5Hz = 10S
// The number of periods per time intervall is: frequency*timeIntervall = 5Hz*1.0s = (5/s)*1.0s = 5
// Open file for writing signal values
char fileName[] = "sinvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(fileName,"w+");
if(outputFile==NULL){
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
}
// Calculate signal values and write them to file
double time;
double value;
double dcValue = 0.2;
int idx = 0;
fprintf(stdout, " SampleNbr Signal value\n");
for (time = 0; time<=timeIntervall; time += timeStep){
value = sin(2*M_PI*frequency*time) + dcValue;
fprintf(outputFile, "%lf\n",value);
fprintf(stdout, "%10i %15.5f\n",idx++, value);
}
fclose(outputFile);
performDFT(nbrOfSamples, fileName);
return 0;
}
If the input of a dft is pure real, the output is complex in any case.
So you have to use the plan r2c (RealToComplex).
If the signal is sin(2*pi*f*t), starting at t=0, the spectrum contains a single frequency line
at f, which is pure imaginary.
If the sign has an offset in phase, like sin(2*pi*f*t+phi) the single line's value is complex.
If your sampling frequency is fs, the range of the output spectrum is -fs/2 ... +fs/2.
The real parts of the positive and negative frequencies are the same.
The imaginary parts of the positive and negative frequencies have opposite signs.
This is called conjugated complex.
If you have the complex values of the positive spectrum you can calculate the values of the
negative spectrum by changing the sign of the imaginary parts.
For this reason there is no need to compute both, the positive and the negative sprectrum.
One sideband holds all information.
Therefore the number of output samples in the plan r2c is the half+1 of the number
of input samples.
To get the power of a frequency, you have to consider the positive frequency as well
as the negative frequency. However, the plan r2c delivers only the right positive half
of the spectrum. So you have to double the power of the positive side to get the total power.
By the way, the documentation of the fftw3 package describes the usage of plans quite well.
You should invest the time to go over the manual.
I'm not sure what your question is. Your results seem reasonable, with the information provided.
As you must know, the PSD is the Fourier transform of the autocorrelation function. With sine wave inputs, your AC function will be periodic, therefore the PSD will have tones, like you've plotted.
My 'answer' is really some thought starters on debugging. It would be easier for all involved if we could post equations. You probably know that there's a signal processing section on SE these days.
First, you should give us a plot of your AC function. The inverse FT of the PSD you've shown will be a linear combination of periodic tones.
Second, try removing the window, just make it a box or skip the step if you can.
Third, try replacing the DFT with the FFT (I only skimmed the fftw3 library docs, maybe this is an option).
Lastly, trying inputting white noise. You can use a Bernoulli dist, or just a Gaussian dist. The AC will be a delta function, although the sample AC will not. This should give you a (sample) white PSD distribution.
I hope these suggestions help.

Calculate (x exponent 0.19029) with low memory using lookup table?

I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following:
A = k . (1 - (p/p0)^0.19029)
k and p0 are constant, so it's all pretty simple apart from finding x^0.19029
(p/p0) ratio would always be in the range 0-1.
It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%).
I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table.
My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved.
Could this approach work?
Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with:
b^x = (e^(ln b))^x = e^(x.ln(b))
(See: Wikipedia - Powers via Logarithms)
This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)?
Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness!
[EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6
Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation.
In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib).
Edit (Removed)
(Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.)
Edit 2
My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in
glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c
Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent.
Edit 3
Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed):
#define __LITTLE_ENDIAN
#ifdef __LITTLE_ENDIAN
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
#else
#define __HI(x) *(int*)&x
#define __LO(x) *(1+(int*)&x)
#endif
static const double
bp[] = {1.0, 1.5,},
dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */
dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */
zero = 0.0,
one = 1.0,
two = 2.0,
two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */
/* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */
L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */
L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */
L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */
L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */
L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */
P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */
P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */
P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */
P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */
P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */
lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */
lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */
lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */
ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */
cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */
cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */
cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/
ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */
ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/
ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/
double pow0p19029(double x)
{
double y = 0.19029e+00;
double z,ax,z_h,z_l,p_h,p_l;
double y1,t1,t2,r,s,t,u,v,w;
int i,j,k,n;
int hx,hy,ix,iy;
unsigned lx,ly;
hx = __HI(x); lx = __LO(x);
hy = __HI(y); ly = __LO(y);
ix = hx&0x7fffffff; iy = hy&0x7fffffff;
ax = x;
/* special value of x */
if(lx==0) {
if(ix==0x7ff00000||ix==0||ix==0x3ff00000){
z = ax; /*x is +-0,+-inf,+-1*/
return z;
}
}
s = one; /* s (sign of result -ve**odd) = -1 else = 1 */
double ss,s2,s_h,s_l,t_h,t_l;
n = ((ix)>>20)-0x3ff;
j = ix&0x000fffff;
/* determine interval */
ix = j|0x3ff00000; /* normalize ix */
if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */
else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */
else {k=0;n+=1;ix -= 0x00100000;}
__HI(ax) = ix;
/* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */
v = one/(ax+bp[k]);
ss = u*v;
s_h = ss;
__LO(s_h) = 0;
/* t_h=ax+bp[k] High */
t_h = zero;
__HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18);
t_l = ax - (t_h-bp[k]);
s_l = v*((u-s_h*t_h)-s_h*t_l);
/* compute log(ax) */
s2 = ss*ss;
r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
r += s_l*(s_h+ss);
s2 = s_h*s_h;
t_h = 3.0+s2+r;
__LO(t_h) = 0;
t_l = r-((t_h-3.0)-s2);
/* u+v = ss*(1+...) */
u = s_h*t_h;
v = s_l*t_h+t_l*ss;
/* 2/(3log2)*(ss+...) */
p_h = u+v;
__LO(p_h) = 0;
p_l = v-(p_h-u);
z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */
z_l = cp_l*p_h+p_l*cp+dp_l[k];
/* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */
t = (double)n;
t1 = (((z_h+z_l)+dp_h[k])+t);
__LO(t1) = 0;
t2 = z_l-(((t1-t)-dp_h[k])-z_h);
/* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
y1 = y;
__LO(y1) = 0;
p_l = (y-y1)*t1+y*t2;
p_h = y1*t1;
z = p_l+p_h;
j = __HI(z);
i = __LO(z);
/*
* compute 2**(p_h+p_l)
*/
i = j&0x7fffffff;
k = (i>>20)-0x3ff;
n = 0;
if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */
n = j+(0x00100000>>(k+1));
k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */
t = zero;
__HI(t) = (n&~(0x000fffff>>k));
n = ((n&0x000fffff)|0x00100000)>>(20-k);
if(j<0) n = -n;
p_h -= t;
}
t = p_l+p_h;
__LO(t) = 0;
u = t*lg2_h;
v = (p_l-(t-p_h))*lg2+t*lg2_l;
z = u+v;
w = v-(z-u);
t = z*z;
t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
r = (z*t1)/(t1-two)-(w+z*w);
z = one-(r-z);
__HI(z) += (n<<20);
return s*z;
}
Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again.
So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above.
Edit 4
Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good.
The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax
polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp:
#include <ostream>
#define TARG_PREC 64
#define WORK_PREC (TARG_PREC*2)
#include <boost/multiprecision/cpp_dec_float.hpp>
typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype;
using boost::math::pow;
#include <boost/math/tools/remez.hpp>
boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez;
dtype f(const dtype& x) {
static const dtype one(1), y(0.19029);
return one - pow(one - x, y);
}
void out(const char *descr, const dtype& x, const char *sep="") {
std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl;
}
int main() {
dtype a(0), b(0.7); // range to optimise over
bool rel_error(false), pin(true);
int orderN(7), orderD(0), skew(0), brake(50);
int prec = 2 + (TARG_PREC * 3010LL)/10000;
std::cout << std::scientific << std::setprecision(prec);
p_remez.reset(new boost::math::tools::remez_minimax<dtype>(
&f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC));
out("Max error in interpolated form: ", p_remez->max_error());
p_remez->set_brake(brake);
unsigned i, count(50);
for (i = 0; i < count; ++i) {
std::cout << "Stepping..." << std::endl;
dtype r = p_remez->iterate();
out("Maximum Deviation Found: ", p_remez->max_error());
out("Expected Error Term: ", p_remez->error_term());
out("Maximum Relative Change in Control Points: ", r);
}
boost::math::tools::polynomial<dtype> n = p_remez->numerator();
for(i = n.size(); i--; ) {
out("", n[i], ",");
}
}
Since all parts of boost that we use are header-only, simply build with:
c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax
We finally get the coefficients, which are after multiplication by 44330:
24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0.
The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659):
Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.
Instead of a lookup table, I'd use a polynomial approximation:
1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701
Or in code:
double f(double x) {
double fx;
fx = - 1073365.91783;
fx = fx*x + 8354695.40833;
fx = fx*x - 29422576.6529;
fx = fx*x + 61993794.537;
fx = fx*x - 87079891.4988;
fx = fx*x + 86005723.842;
fx = fx*x - 61389954.7459;
fx = fx*x + 32053170.1149;
fx = fx*x - 12253383.4372;
fx = fx*x + 3399819.97536;
fx = fx*x - 672003.142815;
fx = fx*x + 91817.6782072;
fx = fx*x - 8299.75873768;
fx = fx*x + 469.530204564;
fx = fx*x - 16.6572179869;
fx = fx*x + 0.722044145701;
return fx;
}
I computed this in sage using the least squares approach:
f(x) = 1-x^(19029/100000) # your function
d = 16 # number of terms, i.e. degree + 1
A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1)))
b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)])
A.solve_right(b).change_ring(RDF)
Here is a plot of the error this will entail:
Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations.
I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part.
Update
Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements:
44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x)
double f(double x) {
double fx, x1 = 1. - x;
fx = + 23024.9160933;
fx = fx*x1 - 39408.6473636;
fx = fx*x1 + 31379.9086193;
fx = fx*x1 - 10098.7031260;
fx = fx*x1 + 4339.44098317;
fx = fx*x1 + 3202.85705860;
fx = fx*x1 + 8442.42528906;
fx = fx*x1;
return fx;
}
I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here.
I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.
Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful:
This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two.
It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless.
#include <stdio.h>
#include <math.h>
float powers[102];
int main (void)
{
int i, as_int;
double as_real, low, high, delta, approx, calcd, diff;
printf ("calculating and storing:\n");
for (i=0; i<=101; i++)
{
as_real = pow(i/100.0, 0.19029);
as_int = (int)round(65536*as_real);
powers[i] = as_real;
diff = 100*as_real/(as_int/65536.0);
printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff);
}
printf ("\n");
printf ("-- interpolating in 1/10ths:\n");
for (i=0; i<1000; i++)
{
as_real = i/1000.0;
low = powers[i/10];
high = powers[1+i/10];
delta = (high-low)/10.0;
approx = low + (i%10)*delta;
calcd = pow(as_real, 0.19029);
diff = 100.0*approx/calcd;
printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff);
}
return 0;
}
You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.
LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy.
Don't be scared by the #include <math.h>; it's just for benchmarking the code.
#include <stdio.h>
#include <math.h>
double my_sqrt(double x) {
/* Newton's method for a square root. */
int i = 0;
double res = 1.0;
if (x > 0) {
for (i = 0; i < 10; i++) {
res = 0.5 * (res + x / res);
}
} else {
res = 0.0;
}
return res;
}
double my_152232(double x) {
/* Cubic spline interpolation for x ** 1.52232. */
int i = 0;
double res = 0.0;
/* coefs[i] will give the cubic polynomial coefficients between x =
i and x = i+1. Out of laziness, the below numbers give only a
linear interpolation. You'll need to do some work and research
to get the spline coefficients. */
double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0},
{-0.872526, 1.872526, 0.0, 0.0},
{-2.032706, 2.452616, 0.0, 0.0}};
if ((x >= 0) && (x < 3.0)) {
i = (int) x;
/* Horner's method cubic. */
res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x)
+ coefs[i][0];
} else if (x >= 3.0) {
/* Scaled x ** 1.5 once you go off the spline. */
res = 1.024824 * my_sqrt(x * x * x);
}
return res;
}
double my_019029(double x) {
return my_sqrt(my_sqrt(my_sqrt(my_152232(x))));
}
int main() {
int i;
double x = 0.0;
for (i = 0; i < 1000; i++) {
x = 1e-2 * i;
printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029));
}
return 0;
}
EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series:
double my_152232(double x) {
double part_050000 = my_sqrt(x);
double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x;
return part_102232 * part_050000;
}
This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.

Resources