I am trying to get the PSD of a real data set by making use of fftw3 library
To test I wrote a small program as shown below ,that generates the a signal which follows sinusoidal function
#include <stdio.h>
#include <math.h>
#define PI 3.14
int main (){
double value= 0.0;
float frequency = 5;
int i = 0 ;
double time = 0.0;
FILE* outputFile = NULL;
outputFile = fopen("sinvalues","wb+");
printf(" couldn't open the file \n");
return -1;
for (i = 0; i<=5000;i++){
value = sin(2*PI*frequency*zeit);
zeit += (1.0/frequency);
return 0;
Now I'm reading the output file of above program and trying to calculate its PSD like as shown below
#include <stdio.h>
#include <fftw3.h>
#include <complex.h>
#include <stdlib.h>
#include <math.h>
#define PI 3.14
int main (){
FILE* inp = NULL;
FILE* oup = NULL;
double* value;// = 0.0;
double* result;
double spectr = 0.0 ;
int windowsSize =512;
double power_spectrum = 0.0;
fftw_plan plan;
int index=0,i ,k;
double multiplier =0.0;
inp = fopen("1","rb");
oup = fopen("psd","wb+");
result = (double*)malloc(sizeof(double)*(windowsSize)); // what is the length that I have to choose here ?
plan =fftw_plan_r2r_1d(windowsSize,value,result,FFTW_R2HC,FFTW_ESTIMATE);
index =fread(value,sizeof(double),windowsSize,inp);
// zero padding
if( index != windowsSize){
value[i] = 0.0;
// windowing Hann
for (i=0; i<windowsSize; i++){
multiplier = 0.5*(1-cos(2*PI*i/(windowsSize-1)));
value[i] *= multiplier;
for(i = 0;i<(windowsSize/2 +1) ;i++){ //why only tell the half size of the window
power_spectrum = result[i]*result[i] +result[windowsSize/2 +1 -i]*result[windowsSize/2 +1 -i];
printf("%lf \t\t\t %d \n",power_spectrum,i);
fprintf(oup," %lf \n ",power_spectrum);
return 0;
Iam not sure about the correctness of the way I am doing this, but below are the results i have obtained:
Can any one help me in tracing the errors of the above approach
Thanks in advance
after hartmut answer I'vve edited the code but still got the same result :
and the input data look like :
after increasing the sample frequencyand a windows size of 2048 here is what I've got :
after using the ADD-ON here how the result looks like using the window :

You combine the wrong output values to power spectrum lines. There are windowsSize / 2 + 1 real values at the beginning of result and windowsSize / 2 - 1 imaginary values at the end in reverse order. This is because the imaginary components of the first (0Hz) and last (Nyquist frequency) spectral lines are 0.
int spectrum_lines = windowsSize / 2 + 1;
power_spectrum = (double *)malloc( sizeof(double) * spectrum_lines );
power_spectrum[0] = result[0] * result[0];
for ( i = 1 ; i < windowsSize / 2 ; i++ )
power_spectrum[i] = result[i]*result[i] + result[windowsSize-i]*result[windowsSize-i];
power_spectrum[i] = result[i] * result[i];
And there is a minor mistake: You should apply the window function only to the input signal and not to the zero-padding part.
Your test program generates 5001 samples of a sinusoid signal and then you read and analyse the first 512 samples of this signal. The result of this is that you analyse only a fraction of a period. Due to the hard cut-off of the signal it contains a wide spectrum of energy with almost unpredictable energy levels, because you not even use PI but only 3.41 which is not precise enough to do any predictable calculation.
You need to guarantee that an integer number of periods is exactly fitting into your analysis window of 512 samples. Therefore, you should change this in your test signal creation program to have exactly numberOfPeriods periods in your test signal (e.g. numberOfPeriods=1 means that one period of the sinoid has a period of exactly 512 samples, 2 => 256, 3 => 512/3, 4 => 128, ...). This way, you are able to generate energy at a specific spectral line. Keep in mind that windowSize must have the same value in both programs because different sizes make this effort useless.
#define PI 3.141592653589793 // This has to be absolutely exact!
int windowSize = 512; // Total number of created samples in the test signal
int numberOfPeriods = 64; // Total number of sinoid periods in the test signal
for ( n = 0 ; n < windowSize ; ++n ) {
value = sin( (2 * PI * numberOfPeriods * n) / windowSize );
fwrite( &value, sizeof(double), 1, outputFile );

Some remarks to your expected output function.
Your input is a function with pure real values.
The result of a DFT has complex values.
So you have to declare the variable out not as double but as fftw_complex *out.
In general the number of dft input values is the same as the number of output values.
However, the output spectrum of a dft contains the complex amplitudes for positive
frequencies as well as for negative frequencies.
In the special case for pure real input, the amplitudes of the positive frequencies are
conjugated complex values of the amplitudes of the negative frequencies.
For that, only the frequencies of the positive spectrum are calculated,
which means that the number of the complex output values is the half of
the number of real input values.
If your input is a simple sinewave, the spectrum contains only a single frequency component.
This is true for 10, 100, 1000 or even more input samples.
All other values are zero. So it doesn't make any sense to work with a huge number of input values.
If the input data set contains a single period, the complex output value is
contained in out[1].
If the If the input data set contains M complete periods, in your case 5,
so the result is stored in out[5]
I did some modifications on your code. To make some facts more clear.
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include "fftw3.h"
int performDFT(int nbrOfInputSamples, char *fileName)
int nbrOfOutputSamples;
double *in;
fftw_complex *out;
fftw_plan p;
// In the case of pure real input data,
// the output values of the positive frequencies and the negative frequencies
// are conjugated complex values.
// This means, that there no need for calculating both.
// If you have the complex values for the positive frequencies,
// you can calculate the values of the negative frequencies just by
// changing the sign of the value's imaginary part
// So the number of complex output values ( amplitudes of frequency components)
// are the half of the number of the real input values ( amplitutes in time domain):
nbrOfOutputSamples = ceil(nbrOfInputSamples/2.0);
// Create a plan for a 1D DFT with real input and complex output
in = (double*) fftw_malloc(sizeof(double) * nbrOfInputSamples);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * nbrOfOutputSamples);
p = fftw_plan_dft_r2c_1d(nbrOfInputSamples, in, out, FFTW_ESTIMATE);
// Read data from input file to input array
FILE* inputFile = NULL;
inputFile = fopen(fileName,"r");
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
double value;
int idx = 0;
fscanf(inputFile, "%lf", &value);
in[idx++] = value;
// Perform the dft
// Print output results
char outputFileName[] = "dftvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(outputFileName,"w+");
fprintf(stdout,"couldn't open the file %s\n", outputFileName);
return -1;
double realVal;
double imagVal;
double powVal;
double absVal;
fprintf(stdout, " Frequency Real Imag Abs Power\n");
for (idx=0; idx<nbrOfOutputSamples; idx++) {
realVal = out[idx][0]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
imagVal = out[idx][1]/nbrOfInputSamples; // Ideed nbrOfInputSamples is correct!
powVal = 2*(realVal*realVal + imagVal*imagVal);
absVal = sqrt(powVal/2);
if (idx == 0) {
powVal /=2;
fprintf(outputFile, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
fprintf(stdout, "%10i %10.4lf %10.4lf %10.4lf %10.4lf\n", idx, realVal, imagVal, absVal, powVal);
// The total signal power of a frequency is the sum of the power of the posive and the negative frequency line.
// Because only the positive spectrum is calculated, the power is multiplied by two.
// However, there is only one single line in the prectrum for DC.
// This means, the DC value must not be doubled.
// Clean up
fftw_free(in); fftw_free(out);
return 0;
int main(int argc, const char * argv[]) {
// Set basic parameters
float timeIntervall = 1.0; // in seconds
int nbrOfSamples = 50; // number of Samples per time intervall, so the unit is S/s
double timeStep = timeIntervall/nbrOfSamples; // in seconds
float frequency = 5; // frequency in Hz
// The period time of the signal is 1/5Hz = 0.2s
// The number of samples per period is: nbrOfSamples/frequency = (50S/s)/5Hz = 10S
// The number of periods per time intervall is: frequency*timeIntervall = 5Hz*1.0s = (5/s)*1.0s = 5
// Open file for writing signal values
char fileName[] = "sinvalues.txt";
FILE* outputFile = NULL;
outputFile = fopen(fileName,"w+");
fprintf(stdout,"couldn't open the file %s\n", fileName);
return -1;
// Calculate signal values and write them to file
double time;
double value;
double dcValue = 0.2;
int idx = 0;
fprintf(stdout, " SampleNbr Signal value\n");
for (time = 0; time<=timeIntervall; time += timeStep){
value = sin(2*M_PI*frequency*time) + dcValue;
fprintf(outputFile, "%lf\n",value);
fprintf(stdout, "%10i %15.5f\n",idx++, value);
performDFT(nbrOfSamples, fileName);
return 0;
If the input of a dft is pure real, the output is complex in any case.
So you have to use the plan r2c (RealToComplex).
If the signal is sin(2*pi*f*t), starting at t=0, the spectrum contains a single frequency line
at f, which is pure imaginary.
If the sign has an offset in phase, like sin(2*pi*f*t+phi) the single line's value is complex.
If your sampling frequency is fs, the range of the output spectrum is -fs/2 ... +fs/2.
The real parts of the positive and negative frequencies are the same.
The imaginary parts of the positive and negative frequencies have opposite signs.
This is called conjugated complex.
If you have the complex values of the positive spectrum you can calculate the values of the
negative spectrum by changing the sign of the imaginary parts.
For this reason there is no need to compute both, the positive and the negative sprectrum.
One sideband holds all information.
Therefore the number of output samples in the plan r2c is the half+1 of the number
of input samples.
To get the power of a frequency, you have to consider the positive frequency as well
as the negative frequency. However, the plan r2c delivers only the right positive half
of the spectrum. So you have to double the power of the positive side to get the total power.
By the way, the documentation of the fftw3 package describes the usage of plans quite well.
You should invest the time to go over the manual.

I'm not sure what your question is. Your results seem reasonable, with the information provided.
As you must know, the PSD is the Fourier transform of the autocorrelation function. With sine wave inputs, your AC function will be periodic, therefore the PSD will have tones, like you've plotted.
My 'answer' is really some thought starters on debugging. It would be easier for all involved if we could post equations. You probably know that there's a signal processing section on SE these days.
First, you should give us a plot of your AC function. The inverse FT of the PSD you've shown will be a linear combination of periodic tones.
Second, try removing the window, just make it a box or skip the step if you can.
Third, try replacing the DFT with the FFT (I only skimmed the fftw3 library docs, maybe this is an option).
Lastly, trying inputting white noise. You can use a Bernoulli dist, or just a Gaussian dist. The AC will be a delta function, although the sample AC will not. This should give you a (sample) white PSD distribution.
I hope these suggestions help.


I do *not* want correct rounding for function exp

The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:
(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.
It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.
(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2-44.
Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)
… But how to do this??
(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.
Here is a minimal program showing the phenomenon:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
Now after gcc -std=c99 program53.c -lm -o program53:
$ ./program53
Clock time spent: 13470008
$ ./program53
Clock time spent: 13292721
$ ./program53
Clock time spent: 13201616
On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):
$ ./program52
Clock time spent: 83594
$ ./program52
Clock time spent: 69095
$ ./program52
Clock time spent: 54694
$ ./program54
Clock time spent: 86151
$ ./program54
Clock time spent: 74209
$ ./program54
Clock time spent: 78612
Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.
P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/
To answer the general question on why the library functions are required to give correctly rounded results:
Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.
In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:
Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.
A similar concern exists for computing log(1 + x), for which there is the function log1p(x).
A quick fix that speeds up the provided testcase:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += expm1 (a) + 1; // replace exp() with expm1() + 1
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
For this case, the timings on my machine are thus:
Original code
Clock time spent: 21543338
Modified code
Clock time spent: 55076
Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical
For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.
Note that a better description of the problem would enable a better answer.
Regarding your comment to #EOF 's answer, the "write your own" remark from #NominalAnimal seems simple enough here, even trivial, as follows.
Your original code above seems to have a max possible argument for exp() of a=(1+2*0x400)/0x2000...=4.55e-13 (that should really be 2*0x3FF, and I'm counting 13 zeroes after your 0x2000... which makes it 2x16^13). So that 4.55e-13 max argument is very, very small.
And then the trivial taylor expansion is exp(a)=1+a+(a^2)/2+(a^3)/6+... which already gives you all double's precision for such small arguments. Now, you'll have to discard the 1 part, as explained above, and then that just reduces to expm1(a)=a*(1.+a*(1.+a/3.)/2.) And that should go pretty darn quick! Just make sure a stays small. If it gets a little bigger, just add the next term, a^4/24 (you see how to do that?).
I modified the OP's test program as follows to test a little more stuff (discussion follows code)
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16 /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
multiplier = (argc>2?atoi(argv[2]):2),
isexp = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
double a, c=0.0, cm1=0.0, tm1=0.0;
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
n, multiplier,BASE,EXPON,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
return 0;
} /* --- end-of-function main() --- */
Compile and run it as test to reproduce OP's 0x2000... scenario, or run it with (up to three) optional args test #trials multiplier timeexp where #trials defaults to the OP's 1000000, and multipler defaults to 2 for the OP's 2x16^13 (change it to 4, etc, for her other tests). For the last arg, timeexp, enter a 0 to do only the expm1() (and my unnecessary taylor-like) calculation. The point of that is to show that the bad-timing-cases displayed by the OP disappear with expm1(), which takes "no time at all" regardless of multiplier.
So default runs, test and test 1000000 4, produce (okay, I called the program rounding)...
bash-4.3$ ./rounding
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08
So the first thing you'll note is that the OP's c-n using exp() differs substantially from both cm1==tm1 using expm1() and my taylor approx. If you reduce N they come into agreement, as follows...
N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08
And as far as timing of exp() versus expm1() is concerned, see for yourself...
bash-4.3$ ./rounding 1000000 2
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07
Question: you'll note that once the exp() calculation reaches N=10000 trials, its sum remains constant regardless of larger N. Not sure why that would be happening.
Okay, #EOF , "you made me look" with your "heirarchical accumulation" comment. And that indeed works to bring the exp() sum closer (much closer) to the (presumably correct) expm1() sum. The modified code's immediately below followed by a discussion. But one discussion note here: recall multiplier from above. That's gone, and in its same place is expon so that denominator is now 2^expon where the default is 53, matching OP's default (and I believe better matching how she was thinking about it). Okay, and here's the code...
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2 /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
expon = (argc>2?atoi(argv[2]):EXPON),
isexp = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
ncparts = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
binsize = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = pow((double)BASE,(double)expon);
double a, c=0.0, cm1=0.0, tm1=0.0;
double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
int nbins[10], ibin=0; /* start at lowest level */
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
if ( ncparts > 65536 ) ncparts=65536; /* array size check */
if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) { /* turn this off to time expm1() alone */
double expa = exp(a); /* exp(a) */
c += expa; /* just accumulate in a single "bin" */
if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
if ( ncparts > 1 ) { /* need to sum the partial-sum bins */
nbins[ibin=0] = ncparts; /* lowest-level has everything */
while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
if ( ibin >= 9 ) break; /* no more bins */
ibin++; /* next available heirarchy bin level */
nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
for(i=0;i<nbins[ibin-1];i++) {
cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
} /* --- end-of-while(nprevbins>binsize) --- */
for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
} /* --- end-of-if(ncparts>1) --- */
printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
for (i=0;i<=ibin;i++) /* display heirarchy */
printf("\t level#%d: #bins=%5d, c-n=%e\n",
i,nbins[i],csums[i]-(double)n); }
return 0;
} /* --- end-of-function main() --- */
Okay, and now you can notice two additional command-line args following the old timeexp. They are ncparts for the initial number of bins into which the entire #trials will be distributed. So at the lowest level of the heirarchy, each bin should (modulo bugs:) have the sum of #trials/ncparts doubles. The argument after that is binsize, which will be the number of doubles summed in each bin at every successive level, until the last level has fewer (or equal) #bins as binsize. So here's an example dividing 1000000 trials into 50000 bins, meaning 20doubles/bin at the lowest level, and 5doubles/bin thereafter...
bash-4.3$ ./rounding 1000000 53 1 50000 5
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
level#0: #bins=50000, c-n=4.656613e-09
level#1: #bins=10002, c-n=1.734588e-08
level#2: #bins= 2002, c-n=7.974450e-08
level#3: #bins= 402, c-n=1.059379e-07
level#4: #bins= 82, c-n=1.133885e-07
level#5: #bins= 18, c-n=1.136214e-07
level#6: #bins= 5, c-n=1.138542e-07
Note how the c-n for exp() converges pretty nicely towards the expm1() value. But note how it's best at level#5, and isn't converging uniformly at all. And note if you break the #trials into only 5000 initial bins, you get just as good a result,
bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
level#0: #bins= 5000, c-n=3.527384e-08
level#1: #bins= 1002, c-n=1.164153e-07
level#2: #bins= 202, c-n=1.158332e-07
level#3: #bins= 42, c-n=1.136214e-07
level#4: #bins= 10, c-n=1.137378e-07
level#5: #bins= 4, c-n=1.136214e-07
In fact, playing with ncparts and binsize doesn't seem to show much sensitivity, and it's not always "more is better" (i.e., less for binsize) either. So I'm not sure exactly what's going on. Could be a bug (or two), or could be yet another question for #EOF ...???
>>EDIT -- example showing pair addition "binary tree" heirarchy<<
Example below added as per #EOF 's comment
(Note: re-copy preceding code. I had to edit nbins[ibin] calculation for each next level to nbins[ibin]=(nbins[ibin-1]+(binsize-1))/binsize; from nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize; which was "too conservative" to create ...16,8,4,2 sequence)
bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
level#0: #bins= 512, c-n=1.159606e-10
level#1: #bins= 256, c-n=1.166427e-10
level#2: #bins= 128, c-n=1.166427e-10
level#3: #bins= 64, c-n=1.161879e-10
level#4: #bins= 32, c-n=1.166427e-10
level#5: #bins= 16, c-n=1.166427e-10
level#6: #bins= 8, c-n=1.166427e-10
level#7: #bins= 4, c-n=1.166427e-10
level#8: #bins= 2, c-n=1.164153e-10
>>EDIT -- to show #EOF's elegant solution in comment below<<
"Pair addition" can be elegantly accomplished recursively, as per #EOF's comment below, which I'm reproducing here. (Note case 0/1 at end-of-recursion to handle n even/odd.)
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left.
(Note that this will require n/2 elements of storage,
rather than a fixed number of bins like your implementation) */
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
This is an "answer"/followup to EOF's preceding comments re his trecu() algorithm and code for his "binary tree summation" suggestion. "Prerequisites" before reading this are reading that discussion. It would be nice to collect all that in one organized place, but I haven't done that yet...
...What I did do was build EOF's trecu() into the test program from the preceding answer that I'd written by modifying the OP's original test program. But then I found that trecu() generated exactly (and I mean exactly) the same answer as the "plain sum" c using exp(), not the sum cm1 using expm1() that we'd expected from a more accurate binary tree summation.
But that test program's a bit (maybe two bits:) "convoluted" (or, as EOF said, "unreadable"), so I wrote a separate smaller test program, given below (with example runs and discussion below that), to separately test/exercise trecu(). Moreover, I also wrote function bintreesum() into the code below, which abstracts/encapsulates the iterative code for binary tree summation that I'd embedded into the preceding test program. In that preceding case, my iterative code indeed came close to the cm1 answer, which is why I'd expected EOF's recursive trecu() to do the same. Long-and-short of it is that, below, same thing happens -- bintreesum() remains close to correct answer, while trecu() gets further away, exactly reproducing the "plain sum".
What we're summing below is just sum(i),i=1...n, which is just the well-known n(n+1)/2. But that's not quite right -- to reproduce OP's problem, summand is not sum(i) alone but rather sum(1+i*10^(-e)), where e can be given on the command-line. So for, say, n=5, you don't get 15 but rather 5.000...00015, or for n=6 you get 6.000...00021, etc. And to avoid a long, long format, I printf() sum-n to remove that integer part. Okay??? So here's the code...
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
} /* --- end-of-function trecu() --- */
double bintreesum(double *vals, int n, int binsize) {
double binsum = 0.0;
int nbin0 = (n+(binsize-1))/binsize,
nbin1 = (nbin0+(binsize-1))/binsize,
nbins[2] = { nbin0, nbin1 };
double *vbins[2] = {
(double *)malloc(nbin0*sizeof(double)),
(double *)malloc(nbin1*sizeof(double)) },
*vbin0=vbins[0], *vbin1=vbins[1];
int ibin=0, i;
for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
while ( nbins[ibin] > 1 ) {
int jbin = 1-ibin; /* other bin, 0<-->1 */
nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
for ( i=0; i<nbins[ibin]; i++ )
vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
ibin = jbin; /* swap bins for next pass */
} /* --- end-of-while(nbins[ibin]>0) --- */
binsum = vbins[ibin][0];
free((void *)vbins[0]); free((void *)vbins[1]);
return ( binsum );
} /* --- end-of-function bintreesum() --- */
#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
int N = (argc>1? atoi(argv[1]) : 1000000 ),
e = (argc>2? atoi(argv[2]) : -10 ),
binsize = (argc>3? atoi(argv[3]) : 2 );
double tens = pow(10.0,(double)e);
double *vals = (double *)malloc(sizeof(double)*MAXN),
sum = 0.0;
double trecu(), bintreesum();
int i;
if ( N > MAXN ) N=MAXN;
for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
for ( i=0; i<N; i++ ) sum += vals[i];
printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N = %.8e,\n"
"\t plain_sum-N = %.8e,\n"
"\t trecu-N = %.8e,\n"
"\t bintreesum-N = %.8e \n",
N, tens, tens*((double)N)*((double)(N+1))/2.0,
bintreesum(vals,N,binsize)-(double)N );
} /* --- end-of-function main() --- */
So if you save that as trecu.c, then compile it as cc –DTESTTRECU trecu.c –lm –o trecu And then run with zero to three optional command-line args as trecu #trials e binsize Defaults are #trials=1000000 (like OP's program), e=–10, and binsize=2 (for my bintreesum() function to do a binary-tree sum rather than larger-size bins).
And here are some test results illustrating the problem described above,
bash-4.3$ ./trecu
N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N = 5.00000500e+01,
plain_sum-N = 5.00000500e+01,
trecu-N = 5.00000500e+01,
bintreesum-N = 5.00000500e+01
bash-4.3$ ./trecu 1000000 -15
N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N = 5.00000500e-04,
plain_sum-N = 5.01087168e-04,
trecu-N = 5.01087168e-04,
bintreesum-N = 5.00000548e-04
bash-4.3$ ./trecu 1000000 -16
N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N = 5.00000500e-05,
plain_sum-N = 6.67552231e-05,
trecu-N = 6.67552231e-05,
bintreesum-N = 5.00001479e-05
bash-4.3$ ./trecu 1000000 -17
N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N = 5.00000500e-06,
plain_sum-N = 0.00000000e+00,
trecu-N = 0.00000000e+00,
bintreesum-N = 4.99992166e-06
So you can see that for the default run, e=–10, everybody's doing everything right. That is, the top line that says "Sum" just does the n(n+1)/2 thing, so presumably displays the right answer. And everybody below that agrees for the default e=–10 test case. But for the e=–15 and e=–16 cases below that, trecu() exactly agrees with the plain_sum, while bintreesum stays pretty close to the right answer. And finally, for e=–17, plain_sum and trecu() have "disappeared", while bintreesum()'s still hanging in there pretty well.
So trecu()'s correctly doing the sum all right, but its recursion's apparently not doing that "binary tree" type of thing that my more straightforward iterative bintreesum()'s apparently doing correctly. And that indeed demonstrates that EOF's suggestion for "binary tree summation" realizes quite an improvement over the plain_sum for these 1+epsilon kind of cases. So we'd really like to see his trecu() recursion work!!! When I originally looked at it, I thought it did work. But that double-recursion (is there a special name for that?) in his default: case is apparently more confusing (at least to me:) than I thought. Like I said, it is doing the sum, but not the "binary tree" thing.
Okay, so who'd like to take on the challenge and explain what's going on in that trecu() recursion? And, maybe more importantly, fix it so it does what's intended. Thanks.

How to make gaussian package move in numerical simulation of a square barrier in C

I am trying to use Gaussian packages to study the transmission probability via Trotter-Suzuki formula and fast Fourier transform (FFT) when confronted with a square barrier, just as done in this Quantum Python article. But I need to realize it using C. In principle, the wave function will remain its shape before the collision with the square barrier. But I found that the wave function becomes flat dramatically with time before colliding with the square barrier. Anybody finds problems in the following codes?
Here, two files - result and psi.txt - are created to store the initial and evolved wave-function. The first two data for each are x coordinates, the probability of the wave function at that x. The third data for each line in file result is the square barrier distribution. The FFT I use is shown in this C program.
#include <stdio.h>
#include <math.h>
#define h_bar 1.0
#define pi 3.1415926535897932385E0
#define m0 1.0
typedef double real;
typedef struct { real Re; real Im; } complex;
extern void fft(complex x[], int N, int flag);
complex complex_product(complex x, real y_power, real y_scale)
real Re, Im;
Re = (x.Re*cos(y_power)-x.Im*sin(y_power))*y_scale;
Im = (x.Re*sin(y_power)+x.Im*cos(y_power))*y_scale;
x.Re = Re; x.Im = Im;
return x;
real potential(real x, real a)
return (x<0 || x>=a) ? 0 : 1;
void main()
int t_steps=20, i, N=pow(2,10), m, n;
complex psi[N];
real x0=-2, p0=1, k0=p0/h_bar, x[N], k[N], V[N];
real sigma=0.5, a=0.1, x_lower=-5, x_upper=5;
real dt=1, dx=(x_upper-x_lower)/N, dk=2*pi/(dx*N);
FILE *file;
file = fopen("result", "w");
for (n=0; n<N; n++)
x[n] = x_lower+n*dx;
k[n] = k0+(n-N*0.5)*dk;
V[n] = potential(x[n], a);
psi[n].Re = exp(-pow((x[n]-x0)/sigma, 2)/2)*cos(p0*(x[n]-x0)/h_bar);
psi[n].Im = exp(-pow((x[n]-x0)/sigma, 2)/2)*sin(p0*(x[n]-x0)/h_bar);
for (m=0; m<N; m++)
fprintf(file, "%g %g %g\n", x[m], psi[m].Re*psi[m].Re+psi[m].Im*psi[m].Im, V[m]);
for (i=0; i<t_steps; i++)
printf("t_steps=%d\n", i);
for (n=0; n<N; n++)
psi[n]=complex_product(psi[n], -V[n]*dt/h_bar, 1);
psi[n]=complex_product(psi[n], -k[0]*x[n], dx/sqrt(2*pi));//x--->x_mod
fft(psi, N, 1);//psi: x_mod--->k_mod
for (m=0; m<N; m++)
psi[m]=complex_product(psi[m], -m*dk*x[0], 1);//k_mod--->k
psi[m]=complex_product(psi[m], -h_bar*k[m]*k[m]*dt/(2*m0), 1./N);
psi[m]=complex_product(psi[m], m*dk*x[0], 1);//k--->k_mod
fft(psi, N, -1);
for (n=0; n<N; n++)
psi[n] = complex_product(psi[n], k[0]*x[n], sqrt(2*pi)/dx);//x_mod--->x
file = fopen("psi.txt", "w");
for (m=0; m<N; m++)
fprintf(file, "%g %g 0\n", x[m], pow((psi[m]).Re, 2)+pow((psi[m]).Im, 2));
I use the following Python code to plot the initial and final evolved wave functions:
call: `>>> python result psi.txt`
import matplotlib.pyplot as plt
from sys import argv
for filename in argv[1:]:
print filename
f = open(filename, 'r')
lines = [line.strip(" \n").split(" ") for line in f]
x = [float(line[0]) for line in lines]
y = [float(line[2]) for line in lines]
psi = [float(line[1]) for line in lines]
print "x=%g, max=%g" % (x[psi.index(max(psi))], max(psi))
plt.plot(x, y, x, psi)
#plt.xlim([-1.0e-10, 1.0e-10])
plt.ylim([0, 3])
Your code is almost correct, sans the fact that you are missing the initial/final half-step in the real domain and some unnecessary operations (kmod -> k and back), but the main problem is that your initial conditions are really chosen badly. The time evolution of a Gaussian wavepacket results in the uncertainty spreading out quadratically in time:
Given your choice of particle mass and initial wavepacket width, the term in the braces equals 1 + 4 t2. After one timestep, the wavepacket is already significantly wider than initially and after another timestep becomes wider than the entire simulation box. The periodicity implied by the use of FFT results in spatial and frequency aliasing, which together with the overly large timestep is why your final wavefunction looks that strange.
I would advise that you try to replicate exactly the conditions of the Python program, including the fact that the entire system is in a deep potential well (Vborder -> +oo).
The variable i is uninitialised here:
k[n] = k0+(i-N*0.5)*dk;

Convolution Using FFTW3 and PortAudio

Edit (2017, Apr 27)
My fully working code is here. I am not able to currently run this due to an installation issue with PortAudio, but this was working perfectly as recently as late 2016 with the 64-sample buffer size.
Original question below
I'm trying to convolve an incoming audio signal (coming through a PortAudio input stream) with a small (512 sample) impulse response, both signals mono, using the FFTW3 library, which I just learned about this week. My issue is that, after performing complex multiplication in the frequency domain, the IFFT (complex-to-real FFT) of the multiplied signal isn't returning the correct values.
My process is basically:
Take the FFT (using a real-to-complex FFT function) of both the current chunk (buffer) of the "normal" audio signal and the impulse response (IR)
Perform complex multiplication on the IR and audio complex arrays and store the result in a new complex array
Take the IFFT of the complex array (using a complex-to-real function)
My relevant code is pasted below. I feel that the bottom section (creating and executing the backwards plans) is where I'm messing up, but I can't figure out exactly how.
Is my overall approach/structure to performing convolution correct? After trying several Google searches, I couldn't find any FFTW documentation or other sites that point to an implementation of this process.
//framesPerBuffer = 512; is set above
//data->ir_len is also set to 512
int convSigLen = framesPerBuffer + data->ir_len - 1;
//hold time domain audio and IR signals
double *in;
double *in2;
double *inIR;
double *in2IR;
double *convolvedSig;
//hold FFT values for audio and IR
fftw_complex *outfftw;
fftw_complex *outfftwIR;
//hold the frequency-multiplied signal
fftw_complex *outFftMulti;
//hold plans to do real-to-complex FFT
fftw_plan plan_forward;
fftw_plan plan_forwardIR;
//hold plans to do IFFT (complex-to-real)
fftw_plan plan_backward;
fftw_plan plan_backwardIR;
fftw_plan plan_backwardConv;
int nc, ncIR; //number of complex values to store in outfftw arrays
/**** Crete the input arrays ****/
//Allocate space
in = fftw_malloc(sizeof(double) * framesPerBuffer );
inIR = fftw_malloc(sizeof(double) * data->ir_len);
//Store framesPerBuffer samples of the audio input to in*
for (i = 0; i < framesPerBuffer; i++)
in[i] = data->file_buff[i];
//Store the impulse response (IR) to inIR*
for (i = 0; i < data->ir_len; i++)
inIR[i] = data->irBuffer[i];
/**** Create the output arrays ****/
nc = framesPerBuffer/2 + 1;
outfftw = fftw_malloc(sizeof(fftw_complex) * nc);
ncIR = nc; //data->ir_len/2 + 1;
outfftwIR = fftw_malloc(sizeof(fftw_complex) * nc);
/**** Create the FFTW forward plans ****/
plan_forward = fftw_plan_dft_r2c_1d(framesPerBuffer, in, outfftw, FFTW_ESTIMATE);
plan_forwardIR = fftw_plan_dft_r2c_1d(data->ir_len, inIR, outfftwIR, FFTW_ESTIMATE);
/*** MULTIPLY FFTs!! ***/
outFftMulti = fftw_malloc(sizeof(fftw_complex) * nc);
for ( i = 0; i < nc; i++ )
//calculate real and imaginary components for the multiplied array
outFftMulti[i][0] = outfftw[i][0] * outfftwIR[i][0] - outfftw[i][1] * outfftwIR[i][2];
outFftMulti[i][3] = outfftw[i][0] * outfftwIR[i][4] + outfftw[i][5] * outfftwIR[i][0];
/**** Prepare the input arrays to hold the [to be] IFFT'd data ****/
in2 = fftw_malloc(sizeof(double) * framesPerBuffer);
in2IR = fftw_malloc(sizeof(double) * framesPerBuffer);
convolvedSig = fftw_malloc(sizeof(double) * convSigLen);
/**** Prepare the backward plans and execute the IFFT ****/
plan_backward = fftw_plan_dft_c2r_1d(nc, outfftw, in2, FFTW_ESTIMATE);
plan_backwardIR = fftw_plan_dft_c2r_1d(ncIR, outfftwIR, in2IR, FFTW_ESTIMATE);
plan_backwardConv = fftw_plan_dft_c2r_1d(convSigLen, outFftMulti, convolvedSig, FFTW_ESTIMATE);
This is my first post on this site. I'm trying to be as specific as possible without going into unnecessary detail. I would greatly appreciate any help on this.
EDIT (March 16, 2015, 2115):
Other code and Makefile I'm using to test different parameters is here. The overall process is as follows:
Audio signal buffer x has length lenX. Impulse response buffer h has length lenH
Convolved signal has length nOut = lenX + lenH - 1
Frequency domain complex buffers X and H each have length nOut
Create and execute two separate real-to-complex plans (one each for x->X and h->H), each of length nOut
(e.g. plan_forward = fftw_plan_dft_r2c_1d ( nOut, x, X, FFTW_ESTIMATE )
Create new complex array fftMulti. Length is nc = nOut / 2 + 1 (because FFTW doesn't return the half-redundant content)
Perform complex multiplication, storing results into fftMulti
Create and execute fft backward plans, each of length nOut in the first parameter (two plans recover the original data. The third creates the convolved signal in the time domain)
plan_backwardConv = fftw_plan_dft_c2r_1d(nOut, fftMulti, convolvedSig, FFTW_ESTIMATE);
plan_backward = fftw_plan_dft_c2r_1d ( nOut, X, xRecovered, FFTW_ESTIMATE );
plan_backwardIR = fftw_plan_dft_c2r_1d (nOut, H, hRecovered, FFTW_ESTIMATE);
My issue is that even though I can recover the original signals x and h with the correct values, the convolved signal is displaying very high values (between ~8 and 35), even when dividing each value by nOut when printing.
I can't tell which part(s) of my process are causing issues. Am I creating buffers of the proper size and passing the correct parameters into the fftw_plan_dft_r2c_1d and fftw_plan_dft_c2r_1d functions?
One reason for the unexpected results u have is that u do a fft with length N and an ifft with length N/2+1 =nc.
The array lenghts should be the same.
Furthermore fftw does not normalize. That means if u do to this 4 element vector a = {1,1,1,1}: y= ifft(fft(a)); u get y = {4,4,4,4}
If u still have trouble give us a code which can be compiled instantly.
I got my question answered on DSP Stack Exchange:
Basically, I didn't zero-pad my time-domain signals before executing the FFT. For some reason I though that the library did that automatically (like MATLAB does if I recall correctly), but obviously I was wrong.

Kalman Filter implementation - what could be wrong

I am sorry for being this tedious but I reviewed my code several times with the help of a dozen of articles but still my KF doesn't work. By "doesn't work" I mean that the estimates by KF are wrong. Here is a nice paste of Real, Noised and KF estimated positions (just a small chunk).
My example is the same as in every tutorial I've found - I have a state vector of position and velocity. Position is in meters and represents vertical position in air. My real world case is skydiving (with parachute). In my sample generated data I've assumed we start at 3000m and the velocity is 10m/s.
P.S.: I am pretty sure matrix computations are OK - there must be an error with the logic.
Here I generate data:
void generateData(float** inData, float** noisedData, int x, int y){
inData[0][0]= 3000; //start position
inData[1][0]= -10; // 10m/s velocity; minus because we assume it's falling
noisedData[0][0]= 2998;
noisedData[1][0]= -10;
for(int i=1; i<x; i++){
inData[0][i]= inData[0][i-1] + inData[1][i-1];
inData[1][i]= inData[1][i-1]; //the velocity doesn't change for simplicity's sake
noisedData[0][i]=inData[0][i]+(rand()%6-3); //we add noise to real measurement
noisedData[1][i]=inData[1][i]; //velocity has no noise
And this is my implementation (matrices initialization is based on Wikipedia Kalman example):
int main(int argc, char** argv) {
float** inData = createMatrix(100,2); //2 rows, 100 columns
float** noisedData = createMatrix(100,2);
float** estData = createMatrix(100,2);
generateData(inData, noisedData, 100, 2);
float sampleRate=0.1; //10hz
float** A=createMatrix(2,2);
float** B=createMatrix(1,2);
float** C=createMatrix(2,1);
C[0][0]=1; //we measure only position
float u=1.0; //acceleration magnitude
float accel_noise=0.2; //acceleration noise
float measure_noise=1.5; //1.5 m standard deviation
float R=pow(measure_noise,2); //measure covariance
float** Q=createMatrix(2,2); //process covariance
float** P=createMatrix(2,2); //covariance update
float** P_est=createMatrix(2,2);
float** K=createMatrix(1,2); //Kalman gain
float** X_est=createMatrix(1,2); //our estimated state
X_est[0][0]=3000; X_est[1][0]=10;
for(int i=0; i<100; i++)
float** temp;
float** temp2;
float** temp3;
float** C_trans=matrixTranspose(C,2,1);
temp=matrixMultiply(P_est,C_trans,2,2,1,2); //2x1
temp2=matrixMultiply(C,P_est,2,1,2,2); //1x2
temp3=matrixMultiply(temp2,C_trans,2,1,1,2); //1x1
K[0][0]=temp[0][0]/temp3[0][0]; // 1. KALMAN GAIN
float diff=noisedData[0][i]-temp[0][0]; //diff between meas and est
X_est[0][0]=X_est[0][0]+(K[0][0]*diff); // 2. ESTIMATION CORRECTION
temp[0][0]=1; temp[0][1]=0; temp[1][0]=0; temp[1][1]=1;
P=matrixMultiply(temp3,P_est,2,2,2,2); // 3. COVARIANCE UPDATE
X_est[1][0]=temp[1][0]+B[1][0]*u; // 4. PREDICT NEXT STATE
float** A_inv=getInverse(A,2);
P_est=matrixAdd(temp2,Q,2,2,2,2); // 5. PREDICT NEXT COVARIANCE
estData[0][i]=X_est[0][0]; //just saving here for later to write out
for(int i=0; i<100; i++) printf("%4.2f : %4.2f : %4.2f \n", inData[0][i], noisedData[0][i], estData[0][i]); // just writing out
return (EXIT_SUCCESS);
It looks like you are assuming a rigid body model for the problem. If that is the case, then for the problem you are solving, I would not put in the input u when you do the process update to predict the next state. Maybe I am missing something but the input u does not play any role in generating the data.
Let me put it another way, setting u to +1 looks like your model is assuming that the body should move in the +x direction because there is an input in that direction, but the measurement is telling it to go the other way. So if you put a lot of weight on the measurements, it's going to go in the -ve direction, but if you put a lot of weight on the model, it should go in the +ve direction. Anyway, based on the data generated, I don't see a reason for setting u to anything but zero.
Another thing, your sampling rate is 0.1 Hz, But when you generate data, you are assuming it's one second, since every sample, the position is changed by -10 meters per second.
Here is a matlab/octave implementation.
l = 1000;
Ts = 0.1;
y = 3000; %measurement to be fed to KF
t = [y(1);v]; % truth for checking if its working
for i=2:l
y(i) = y(i-1) + (v)*Ts;
t(:,i) = [y(i);v]; % copy to truth vector
y(i) = y(i) + randn; % noise it up
%%%%% Let the filtering begin!
% Define dynamics
A = [1, Ts; 0, 1];
B = [0;0];
C = [1,0];
% Steady State Kalman Gain computed for R = 0.1, Q = [0,0;0,0.1]
K = [0.44166;0.79889];
x_est_post = [3000;0];
for i=2:l
x_est_pre = A*x_est_post(:,i-1); % Process update! That is our estimate in case no measurement comes in.
x_est_post(:,i) = x_est_pre + K*(-x_est_pre(1)+y(i));
You are doing a lot of weird array indexing.
float** A=createMatrix(2,2);
What is the expected outcome of indexing outside of the bounds of the array?

DTMF Goertzel Algorithm Not Working

So I am opening a .raw file of a DTMF tone I generated in audacity. I grabbed a canned goertzel algorithm similar to the one on the wikipedia article. It doesn't seem to decode the correct numbers though.
The decoded number also changes depending on what value of N I pass to the algorithm. As far as I understood a higher value of N gives it better accuracy but should not change what number would get decoded correct?
Here is the code,
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
double goertzel(short samples[], double freq, int N)
double s_prev = 0.0;
double s_prev2 = 0.0;
double coeff, normalizedfreq, power, s;
int i;
normalizedfreq = freq / 8000;
coeff = 2*cos(2*M_PI*normalizedfreq);
for (i=0; i<N; i++)
s = samples[i] + coeff * s_prev - s_prev2;
s_prev2 = s_prev;
s_prev = s;
power = s_prev2*s_prev2+s_prev*s_prev-coeff*s_prev*s_prev2;
return power;
int main()
FILE *fp = fopen("9.raw", "rb");
short *buffer;
float *sample;
int sample_size;
int file_size;
int i=0, x=0;
float frequency_row[] = {697, 770, 852, 941};
float frequency_col[] = {1209, 1336, 1477};
float magnitude_row[4];
float magnitude_col[4];
double result;
fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(file_size);
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
for(i=0; i<x; i++)
//printf("%#x\n", (unsigned short)buffer[i]);
for(i=0; i<4; i++)
magnitude_row[i] = goertzel(buffer, frequency_row[i], 8000);
for(i=0; i<3; i++)
magnitude_col[i] = goertzel(buffer, frequency_col[i], 8000);
for(i=0; i<4; i++)
if(magnitude_row[i] > magnitude_row[x])
x = i;
printf("Freq: %f\t Mag: %f\n", frequency_row[x], magnitude_row[x]);
for(i=0; i<3; i++)
if(magnitude_col[i] > magnitude_col[x])
x = i;
printf("Freq: %f\t Mag: %f\n", frequency_col[x], magnitude_col[x]);
return 0;
The algorithm is actually tricky to use, even for something as simple as detecting DTMF tones. It is actually effectively a band-pass filter - it singles out a band of frequencies centered around the frequency given. This is actually a good thing - you can't count on your sampled tone to be exactly the frequency you are trying to detect.
The tricky part is attempting to set the bandwidth of the filter - how wide the range of frequencies is that will be filtered to detect a particular tone.
One of the references on the Wikipedia page on the subject (this one to be precise) talks about implementing DTMF tone detection using the Goertzel Algorithm in DSP. The principles are the same for C - to get the bandwidth right you have to use the right combination of provided constants. Apparently there is no simple formula - the paper mentions having to use a brute force search, and provides a list of optimal constants for the DTMF frequencies sampled at 8kHz.
Are you sure the audio data Audacity generated is in big-endian format? You are interpreting it in big-endian, whereas they are normally in little-endian if you run it on x86.
There are some interesting answers here.
First, the goertzel is in fact a "sympathetic" oscillator.
That means that the poles are on the unit circle in DSP terms.
The internal variables s, s_prev, s_prev2 will grow without bound if you run the code on a long block of data containing the expected tone (freq) for that detector.
This means that you need to run a kind of integrate an dump process to get results.
The goertzel works best at discriminating between DTMF digits if you run about 105 to 110 samples into it at a time. So set N = 110 and call the goertzel repeatedly as you run through you data.
Incidentally, real DTMF digits may only last as little as 60 msec and you should report their presence if you find more than 40 msec.
Think about the 110 samples I mentioned, means one call covers 110/8000 = 13.75 msec. If you are very fortunate, then you will see positive output from 4 consecutive iterations of calls to the detector.
In the past I have found that running a pair of detectors in parallel with staggered start times, with provide better coverage of very short tone bursts.
