What is the significance of the output of this FFT function? - c

I have the following periodic data which has a period of ~2000:
I am trying to discover the period of the data and the offset of the first peak. I have the following FFT function to perform a Fourier Transform:
typedef double _Complex cplx;
void _fft(cplx buf[], cplx out[], int n, int step){
if (step < n) {
_fft(out, buf, n, step*2);
_fft(out+step, buf+step, n, step*2);
for(int i=0; i<n; i+=step*2) {
cplx t = cexp(-I * M_PI * i / n) * out[i + step];
buf[i / 2] = (out[i] + t);
buf[(i + n)/2] = (out[i] - t);
}
}
}
void fft(cplx* buf, int n){
cplx* out = (cplx*)malloc(sizeof(cplx) * n);
memcpy(out, buf, sizeof(cplx)*n);
_fft(buf, out, n, 1);
for(int i=0; i<n; i++){ buf[i] /= n; }
free(out);
}
Which has been adapted from here: Fast Fourier Transformation (C) (link contains a full running example with main function and example data)
I understand that a Fourier Transform converts time series data into frequency data. Each frequency has a amplitude and a phase. However, I am having a hard time understanding the output given by this function. Graphing the output gives me this:
I have tried graphing the real component, the imaginary component, and the magnitude of both components. Each attempt gives a very similar-looking graph.
Am I wrong to assume there should be a spike at ~2000?
Am I miss-interpreting the output of this function?

Am I wrong to assume there should be a spike at ~2000?
Yes, because 2000 is the period you're interested in, not the frequency. It looks like you're running a 32,768-point FFT, so you should expect to find a peak in bin #16 (16 cycles per 32k = periods of approximately 2048 samples), not bin #2000.
If you want something that reports directly in terms of number of samples, instead of frequency, you may want an autocorrelation, instead of an FFT. Your signal would have autocorrelation peaks at lags of 2000, 4000, etc.

Related

Optimize a weighted moving average

Environment : STM32H7 and GCC
Working with a flow of data : 1 sample received from SPI every 250 us
I do a "triangle" weighted moving average with 256 samples, like this but middle sample is weighted 1 and it forms a triangle around it
My samples are stored in uint32_t val[256] circular buffer, it works with a uint8_t write_index
The samples are 24 bits, the max value of a sample is 0x00FFFFFF
uint8_t write_idx =0;
uint32_t val[256];
float coef[256];
void init(void)
{
uint8_t counter=0;
// I calculate my triangle coefs
for(uint16_t c=0;c<256;c++)
{
coef[c]=(c>127)?--counter:++counter;
coef[c]/=128;
}
}
void ACQ_Complete(void)
{
uint32_t moy=0;
// write_idx is meant to wrap
val[write_idx++]= new_sample;
// calc moving average (uint8_t)(c-write_idx) is meant to wrap
for(uint16_t c=0;c<256;c++)
moy += (uint32_t)(val[c]*coef[(uint8_t)(c-write_idx)]);
moy/=128;
}
I have to do the calcs during a 250 us time span, but I measured with a debug GPIO pin that the "moy" part takes 252 us
Code is simulated here
Interesting fact : If I remove the (uint32_t) cast near the end it takes 274 us instead of 252 us
How can I get it done faster ?
I was thinking of using uint32 instead of float for coef (by multiply by 1000 for example) but my uint32 would overflow
This should unquestionably be done in integer. It will be both faster and more accurate.
This processor can do 32x32+64=64 multiply accumulate in a single cycle!
Multiply all your coefficients by a power of 2 (not 1000 mentioned in the comments), and then shift down at the end rather than divide.
uint32_t coef[256];
uint64_t moy = 0;
for(unsigned int c = 0; c < 256; c++)
{
moy += (val[c] * (uint64_t)coef[(c - write_idx) & 0xFFu]);
}
moy >>= N;

How to write C code for a long signal and long kernel convolution

I would like to do a linear convolution for a signal of length 4000*270, with a kernel of length 16000. The signal is not fixed while the kernel is fixed. This needs to be repeated for many times for my purpose, so I want to improve the speed as soon as possible. I can implement this convolution in either R or C.
At first, I tried doing the convolution in R, but the speed cannot satisfy my need. I tried doing it by iteration and it was too slow. I also tried doing it using FFT, but because both signal and kernel are long, FFT didn't improve the speed a lot.
Then I decided to do convolution iteratively in C. But C seems not to be able to handle such amount of calculation and reported error very often. Even when it works, it is still very slow. I also tried doing fft convolution in C, but the program always shut down.
I found this code from a friend of mine and not sure about the original source. I will delete it if there is a copyright issue.This is the C code I used for doing fft in C, but the program cannot handle the long vector with length 2097152 (the smallest power of 2 greater than or equal to the signal vector length).
#define q 3 /* for 2^3 points */
#define N 2097152 /* N-point FFT, iFFT */
typedef float real;
typedef struct{real Re; real Im;} complex;
#ifndef PI
# define PI 3.14159265358979323846264338327950288
#endif
void fft( complex *v, int n, complex *tmp )
{
if(n>1) { /* otherwise, do nothing and return */
int k,m;
complex z, w, *vo, *ve;
ve = tmp;
vo = tmp+n/2;
for(k=0; k<n/2; k++) {
ve[k] = v[2*k];
vo[k] = v[2*k+1];
}
fft( ve, n/2, v ); /* FFT on even-indexed elements of v[] */
fft( vo, n/2, v ); /* FFT on odd-indexed elements of v[] */
for(m=0; m<n/2; m++) {
w.Re = cos(2*PI*m/(double)n);
w.Im = -sin(2*PI*m/(double)n);
z.Re = w.Re*vo[m].Re - w.Im*vo[m].Im; /* Re(w*vo[m]) */
z.Im = w.Re*vo[m].Im + w.Im*vo[m].Re; /* Im(w*vo[m]) */
v[ m ].Re = ve[m].Re + z.Re;
v[ m ].Im = ve[m].Im + z.Im;
v[m+n/2].Re = ve[m].Re - z.Re;
v[m+n/2].Im = ve[m].Im - z.Im;
}
}
return;
}
void ifft( complex *v, int n, complex *tmp )
{
if(n>1) { /* otherwise, do nothing and return */
int k,m;
complex z, w, *vo, *ve;
ve = tmp;
vo = tmp+n/2;
for(k=0; k<n/2; k++) {
ve[k] = v[2*k];
vo[k] = v[2*k+1];
}
ifft( ve, n/2, v ); /* FFT on even-indexed elements of v[] */
ifft( vo, n/2, v ); /* FFT on odd-indexed elements of v[] */
for(m=0; m<n/2; m++) {
w.Re = cos(2*PI*m/(double)n);
w.Im = sin(2*PI*m/(double)n);
z.Re = w.Re*vo[m].Re - w.Im*vo[m].Im; /* Re(w*vo[m]) */
z.Im = w.Re*vo[m].Im + w.Im*vo[m].Re; /* Im(w*vo[m]) */
v[ m ].Re = ve[m].Re + z.Re;
v[ m ].Im = ve[m].Im + z.Im;
v[m+n/2].Re = ve[m].Re - z.Re;
v[m+n/2].Im = ve[m].Im - z.Im;
}
}
return;
}
I found this page talking about long signal convolution https://ccrma.stanford.edu/~jos/sasp/Convolving_Long_Signals.html
But I'm not sure how to use the idea in it. Any thoughts would be truly appreciated and I'm ready to provide more information about my question.
The most common efficient long FIR filter method is to use FFT/IFFT overlap-add (or overlap-save) fast convolution, as per the CCRMA paper you referenced. Just chop your data into shorter blocks more suitable for your FFT library and processor data cache sizes, zero-pad by at least the filter kernel length, FFT filter, and sequentially overlap-add the remainder/tails after each IFFT.
Huge long FFTs will most likely trash your processor's caches, which will likely dominate over any algorithmic O(NlogN) speedup.

Strategy for doing final reduction

I am trying to implement an OpenCL version for doing reduction of a array of float.
To achieve it, I took the following code snippet found on the web :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group.
Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.
I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.
I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...
I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).
More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.
If that float's order of magnitude is smaller than exa scale, then:
Instead of
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
You could use
if (local_id == 0)
{
if(strategy==ATOMIC)
{
long integer_part=getIntegerPart(localSums[0]);
atom_add (&totalSumIntegerPart[0] ,integer_part);
long float_part=1000000*getFloatPart(localSums[0]);
// 1000000 for saving meaningful 7 digits as integer
atom_add (&totalSumFloatPart[0] ,float_part);
}
}
this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:
float value=0;
if(strategy==ATOMIC)
{
float float_part=getFloatPart_(totalSumFloatPart[0]);
float integer_part=getIntegerPart_(totalSumFloatPart[0])
+ totalSumIntegerPart[0];
value=integer_part+float_part;
}
just a few atomic operations shouldn't be effective on whole kernel time.
Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.
Sorry for previous code.
also It has problem.
CLK_GLOBAL_MEM_FENCE effects only current workgroup.
I confused. =[
If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).
Plaese just take concept.
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
barrier(CLK_GLOBAL_MEM_FENCE);
if(get_group_id(0)==0){
if(local_id < get_num_groups(0)){ // 16384
for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
localSums[local_id] += partialSums[local_id+n];
barrier(CLK_LOCAL_MEM_FENCE);
for(int s=group_size/2;s>0;s/=2){
if(local_id < s)
localSums[local_id] += localSums[local_id+s];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(local_id == 0)
partialSums[0] = localSums[0];
}
}
}

How to make gaussian package move in numerical simulation of a square barrier in C

I am trying to use Gaussian packages to study the transmission probability via Trotter-Suzuki formula and fast Fourier transform (FFT) when confronted with a square barrier, just as done in this Quantum Python article. But I need to realize it using C. In principle, the wave function will remain its shape before the collision with the square barrier. But I found that the wave function becomes flat dramatically with time before colliding with the square barrier. Anybody finds problems in the following codes?
Here, two files - result and psi.txt - are created to store the initial and evolved wave-function. The first two data for each are x coordinates, the probability of the wave function at that x. The third data for each line in file result is the square barrier distribution. The FFT I use is shown in this C program.
#include <stdio.h>
#include <math.h>
#define h_bar 1.0
#define pi 3.1415926535897932385E0
#define m0 1.0
typedef double real;
typedef struct { real Re; real Im; } complex;
extern void fft(complex x[], int N, int flag);
complex complex_product(complex x, real y_power, real y_scale)
{//x*exp(i*y_power)*y_scale
real Re, Im;
Re = (x.Re*cos(y_power)-x.Im*sin(y_power))*y_scale;
Im = (x.Re*sin(y_power)+x.Im*cos(y_power))*y_scale;
x.Re = Re; x.Im = Im;
return x;
}
real potential(real x, real a)
{
return (x<0 || x>=a) ? 0 : 1;
}
void main()
{
int t_steps=20, i, N=pow(2,10), m, n;
complex psi[N];
real x0=-2, p0=1, k0=p0/h_bar, x[N], k[N], V[N];
real sigma=0.5, a=0.1, x_lower=-5, x_upper=5;
real dt=1, dx=(x_upper-x_lower)/N, dk=2*pi/(dx*N);
FILE *file;
file = fopen("result", "w");
//initialize
for (n=0; n<N; n++)
{
x[n] = x_lower+n*dx;
k[n] = k0+(n-N*0.5)*dk;
V[n] = potential(x[n], a);
psi[n].Re = exp(-pow((x[n]-x0)/sigma, 2)/2)*cos(p0*(x[n]-x0)/h_bar);
psi[n].Im = exp(-pow((x[n]-x0)/sigma, 2)/2)*sin(p0*(x[n]-x0)/h_bar);
}
for (m=0; m<N; m++)
fprintf(file, "%g %g %g\n", x[m], psi[m].Re*psi[m].Re+psi[m].Im*psi[m].Im, V[m]);
fclose(file);
for (i=0; i<t_steps; i++)
{
printf("t_steps=%d\n", i);
for (n=0; n<N; n++)
{
psi[n]=complex_product(psi[n], -V[n]*dt/h_bar, 1);
psi[n]=complex_product(psi[n], -k[0]*x[n], dx/sqrt(2*pi));//x--->x_mod
}
fft(psi, N, 1);//psi: x_mod--->k_mod
for (m=0; m<N; m++)
{
psi[m]=complex_product(psi[m], -m*dk*x[0], 1);//k_mod--->k
psi[m]=complex_product(psi[m], -h_bar*k[m]*k[m]*dt/(2*m0), 1./N);
psi[m]=complex_product(psi[m], m*dk*x[0], 1);//k--->k_mod
}
fft(psi, N, -1);
for (n=0; n<N; n++)
psi[n] = complex_product(psi[n], k[0]*x[n], sqrt(2*pi)/dx);//x_mod--->x
}
file = fopen("psi.txt", "w");
for (m=0; m<N; m++)
fprintf(file, "%g %g 0\n", x[m], pow((psi[m]).Re, 2)+pow((psi[m]).Im, 2));
fclose(file);
}
I use the following Python code to plot the initial and final evolved wave functions:
call: `>>> python plot.py result psi.txt`
import matplotlib.pyplot as plt
from sys import argv
for filename in argv[1:]:
print filename
f = open(filename, 'r')
lines = [line.strip(" \n").split(" ") for line in f]
x = [float(line[0]) for line in lines]
y = [float(line[2]) for line in lines]
psi = [float(line[1]) for line in lines]
print "x=%g, max=%g" % (x[psi.index(max(psi))], max(psi))
plt.plot(x, y, x, psi)
#plt.xlim([-1.0e-10, 1.0e-10])
plt.ylim([0, 3])
plt.show()
Your code is almost correct, sans the fact that you are missing the initial/final half-step in the real domain and some unnecessary operations (kmod -> k and back), but the main problem is that your initial conditions are really chosen badly. The time evolution of a Gaussian wavepacket results in the uncertainty spreading out quadratically in time:
Given your choice of particle mass and initial wavepacket width, the term in the braces equals 1 + 4 t2. After one timestep, the wavepacket is already significantly wider than initially and after another timestep becomes wider than the entire simulation box. The periodicity implied by the use of FFT results in spatial and frequency aliasing, which together with the overly large timestep is why your final wavefunction looks that strange.
I would advise that you try to replicate exactly the conditions of the Python program, including the fact that the entire system is in a deep potential well (Vborder -> +oo).
The variable i is uninitialised here:
k[n] = k0+(i-N*0.5)*dk;

DTMF Goertzel Algorithm Not Working

So I am opening a .raw file of a DTMF tone I generated in audacity. I grabbed a canned goertzel algorithm similar to the one on the wikipedia article. It doesn't seem to decode the correct numbers though.
The decoded number also changes depending on what value of N I pass to the algorithm. As far as I understood a higher value of N gives it better accuracy but should not change what number would get decoded correct?
Here is the code,
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
double goertzel(short samples[], double freq, int N)
{
double s_prev = 0.0;
double s_prev2 = 0.0;
double coeff, normalizedfreq, power, s;
int i;
normalizedfreq = freq / 8000;
coeff = 2*cos(2*M_PI*normalizedfreq);
for (i=0; i<N; i++)
{
s = samples[i] + coeff * s_prev - s_prev2;
s_prev2 = s_prev;
s_prev = s;
}
power = s_prev2*s_prev2+s_prev*s_prev-coeff*s_prev*s_prev2;
return power;
}
int main()
{
FILE *fp = fopen("9.raw", "rb");
short *buffer;
float *sample;
int sample_size;
int file_size;
int i=0, x=0;
float frequency_row[] = {697, 770, 852, 941};
float frequency_col[] = {1209, 1336, 1477};
float magnitude_row[4];
float magnitude_col[4];
double result;
fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(file_size);
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
while(!feof(fp))
{
x++;
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
}
for(i=0; i<x; i++)
{
//printf("%#x\n", (unsigned short)buffer[i]);
}
for(i=0; i<4; i++)
{
magnitude_row[i] = goertzel(buffer, frequency_row[i], 8000);
}
for(i=0; i<3; i++)
{
magnitude_col[i] = goertzel(buffer, frequency_col[i], 8000);
}
x=0;
for(i=0; i<4; i++)
{
if(magnitude_row[i] > magnitude_row[x])
x = i;
}
printf("Freq: %f\t Mag: %f\n", frequency_row[x], magnitude_row[x]);
x=0;
for(i=0; i<3; i++)
{
if(magnitude_col[i] > magnitude_col[x])
x = i;
}
printf("Freq: %f\t Mag: %f\n", frequency_col[x], magnitude_col[x]);
return 0;
}
The algorithm is actually tricky to use, even for something as simple as detecting DTMF tones. It is actually effectively a band-pass filter - it singles out a band of frequencies centered around the frequency given. This is actually a good thing - you can't count on your sampled tone to be exactly the frequency you are trying to detect.
The tricky part is attempting to set the bandwidth of the filter - how wide the range of frequencies is that will be filtered to detect a particular tone.
One of the references on the Wikipedia page on the subject (this one to be precise) talks about implementing DTMF tone detection using the Goertzel Algorithm in DSP. The principles are the same for C - to get the bandwidth right you have to use the right combination of provided constants. Apparently there is no simple formula - the paper mentions having to use a brute force search, and provides a list of optimal constants for the DTMF frequencies sampled at 8kHz.
Are you sure the audio data Audacity generated is in big-endian format? You are interpreting it in big-endian, whereas they are normally in little-endian if you run it on x86.
There are some interesting answers here.
First, the goertzel is in fact a "sympathetic" oscillator.
That means that the poles are on the unit circle in DSP terms.
The internal variables s, s_prev, s_prev2 will grow without bound if you run the code on a long block of data containing the expected tone (freq) for that detector.
This means that you need to run a kind of integrate an dump process to get results.
The goertzel works best at discriminating between DTMF digits if you run about 105 to 110 samples into it at a time. So set N = 110 and call the goertzel repeatedly as you run through you data.
Incidentally, real DTMF digits may only last as little as 60 msec and you should report their presence if you find more than 40 msec.
Think about the 110 samples I mentioned, means one call covers 110/8000 = 13.75 msec. If you are very fortunate, then you will see positive output from 4 consecutive iterations of calls to the detector.
In the past I have found that running a pair of detectors in parallel with staggered start times, with provide better coverage of very short tone bursts.

Resources