I have an application where I need to find the position of peaks in a given set of data. The resolution must be much higher than the spacing between the datapoints (i.e. it is not sufficient to find the highest datapoint, instead a "virtual" peak position has to be estimated given the shape of the peak). A peak is made of about 4 or 5 datapoints. A dataset is acquired every few ms and the peak detection has to be performed in real time.
I compared several methods in LabVIEW and I found the best result (in terms of resolution and speed) is given by the LabVIEW PeakDetector.vi, which scans the dataset with a moving window (>= 3 points width) and for each position performs a quadratic fit. The resulting quadratic function (a parabola) has a local maximum, which is in turn compared to nearby points.
Now I want to implement the same method in C. The polynomial fit is implemented as follows (using Gaussian matrix):
// Fits *y from x_start to (x_start + window) with a parabola and returns x_max and y_max
int polymax(uint16_t * y_data, int x_start, int window, double *x_max, double *y_max)
{
float sum[10],mat[3][4],temp=0,temp1=0,a1,a2,a3;
int i,j;
float x[window];
for(i = 0; i < window; i++)
x[i] = (float)i;
float y[window];
for(i = 0; i < window; i++)
y[i] = (float)(y_data[x_start + i] - y_data[x_start]);
for(i = 0; i < window; i++)
{
temp=temp+x[i];
temp1=temp1+y[i];
}
sum[0]=temp;
sum[1]=temp1;
sum[2]=sum[3]=sum[4]=sum[5]=sum[6]=0;
for(i = 0;i < window;i++)
{
sum[2]=sum[2]+(x[i]*x[i]);
sum[3]=sum[3]+(x[i]*x[i]*x[i]);
sum[4]=sum[4]+(x[i]*x[i]*x[i]*x[i]);
sum[5]=sum[5]+(x[i]*y[i]);
sum[6]=sum[6]+(x[i]*x[i]*y[i]);
}
mat[0][0]=window;
mat[0][1]=mat[1][0]=sum[0];
mat[0][2]=mat[1][2]=mat[2][0]=sum[2];
mat[1][2]=mat[2][3]=sum[3];
mat[2][2]=sum[4];
mat[0][3]=sum[1];
mat[1][3]=sum[5];
mat[2][3]=sum[6];
temp=mat[1][0]/mat[0][0];
temp1=mat[2][0]/mat[0][0];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i+2][j]=mat[i+2][j]-(mat[i][j]*temp1);
}
temp=mat[2][4]/mat[1][5];
temp1=mat[0][6]/mat[1][7];
for(i = 1,j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i-1][j]=mat[i-1][j]-(mat[i][j]*temp1);
}
temp=mat[0][2]/mat[2][2];
temp1=mat[1][2]/mat[2][2];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i][j]=mat[i][j]-(mat[i+2][j]*temp);
mat[i+1][j]=mat[i+1][j]-(mat[i+2][j]*temp1);
}
a3 = mat[2][3]/mat[2][2];
a2 = mat[1][3]/mat[1][8];
a1 = mat[0][3]/mat[0][0];
// zX^2 + yX + x
if (a3 < 0)
{
temp = - a2 / (2*a3);
*x_max = temp + x_start;
*y_max = (a3*temp*temp + a2*temp + a1) + y_data[x_start];
return 0;
}
else
return -1;
}
The scan is performed in an outer function, which calls the above function repeatedly and chooses then the highest local y_max.
The above works and peaks are found. Only the noise is much worse than the LabVIEW counterpart (i.e. I get a very oscillating peak position, given the same input dataset and the same parameters). As the algorithm works the above code should be conceptually correct, so I think it might be a numerical problem as I simply use "floats" without further effort to improve numerical accuracy. Is this a possible answer? Does anyone have a tip, where I should be looking to?
Thanks.
PS: I have done my search and found this very good overview and also this question, similar to mine (unfortunately with not many answers). I will study these further.
EDIT: I have found my problems being elsewhere. Improving the algorithm by removing certain output values (a sort of post-validation in which a result is only accepted if the result is within the moving window) brought the solution to the issue. Now I am satisfied with the results, i.e. they are comparable to those from LabVIEW. Nevertheless, thanks a lot for your comments.
Sorry to be late to the part, but if you have C/C++ it is really easy to port it to C# code using VS2013 Express (free version) and just port that into Labview using the .NET toolset.
Related
I am working on embedded programming with written code by other people.
this algorithm be used in calculate average for mic and accelerometer
sound_value_Avg = 0;
sound_value = 0;
memset((char *)soundRaw, 0x00, SOUND_COUNT*2);
for(int i2=0; i2 < SOUND_COUNT; i2++)
{
soundRaw[i2] = analogRead(PIN_ANALOG_IN);
if (i2 == 0)
{
sound_value_Avg = soundRaw[i2];
}
else
{
sound_value_Avg = (sound_value_Avg + soundRaw[i2]) / 2;
}
}
sound_value = sound_value_Avg;
acceleromter is similar to this
n1=p1
(n2+p1)/2 = p2
(n3+p2)/2 = p3
(n4+p3)/2 = p4
...
avg(n1~nx)=px
it not seems to be correct.
can someone explain why he used this algorithm?
is it specific way for sin graph? like noise, vibration?
It appears to be a flawed attempt at maintaining a cumulative mean. The error is in believing that:
An+1 = (An + sn) / 2
when in fact it should be:
An+1 = ((An * n) + s) / (n + 1)
However it is computationally simpler to maintain a running sum and generate an average in the usual manner:
S = S + s
An = S / n
It is possible that the intent was to avoid overflow when the sum grows large, but the attempt is mathematically flawed.
To see how wrong this statement is consider:
True
n s Running Avg. (An + sn) / 2
--------------------------------------
1 20 20 20
2 21 20.5 20.25
3 22 21 20.625
In this case however, nothing is done with the intermediate mean value, so you don'e in fact need to maintain a running mean at all. You simply need to accumulate a running sum and calculate the average at the end. For example:
sum = 0 ;
sound_value = 0 ;
for( int i2 = 0; i2 < SOUND_COUNT; i2++ )
{
soundRaw[i2] = analogRead( PIN_ANALOG_IN ) ;
sum += soundRaw[i2] ;
}
sound_value = sum / SOUND_COUNT ;
In this you do need to make sure that the data type forsum can accommodate a value of the maximum analogRead() return multiplied by SOUND_COUNT.
However you say that this is used for some sort of signal conditioning or processing of both a microphone and an accelerator. These devices have rather dissimilar bandwidth and dynamics, and it seems rather unlikely that the same filter would suit both. Applying robust DSP techniques such as IIR or FIR filters with suitably calculated coefficients would make a great deal more sense. You'd also need a suitable fixed sample rate that I am willing to bet is not achieved by simply reading the ADC in a loop with no specific timing
I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.
I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}
I have to optimize the following function so it runs faster: Note(this is a lower triangle transpose)
void trans(int ** source, int** destination)
{
for (int i = 0 ; i < sizee ; i ++)
{
for (int j = i +1 ; j < sizee ; j ++)
{
destination[i][j]= source[j][i];
}
}
}
I understand that the accesses to source don't have spatial locality because it is being accessed by columns, but I don't understand how I would implement this. Any help is appreciated. Thanks.
EDIT: I tried tiling, although the runtime improved, the optimized transpose is producing the wrong result:
#define b 2
for (int ii = 0 ; ii < sizee ; ii += b) {
for (int jj = ii +1 ; jj < sizee ; jj +=b) {
for(int i = ii; i < std::min(ii+b-1, sizee); i++)
{
for(int j = jj; j < std::min(jj+b-1, sizee); j++)
{
destination[i][j]= source[j][i];
}
}
}
}
One way of doing a cache-friendly transpose algorithm is to tile the data:
- for each square tile
- load a square tile from source into a temporary buffer
- transpose tile in-place
- write out transpose tile to its correct location in dest
Choose the tile size so that it fits comfortably within cache.
For further optimisation you can work on the in-place tile transpose routine - there are plenty of micro-optimisations you can do on e.g. an 8x8 or 16x16 in-place transpose.
Note: this answer was provided for the original version of the question when it was not apparent that the requirement was for a partial transpose. I'm leaving the answer here though as it has some useful comments below.
You can start by inverting your loop. Put j on the outside and i on the inside. Here's why: the following locations are all right next to each other in memory:
source[j][0];
source[j][1];
source[j][2];
source[j][3];
But these locations are not:
source[0][i];
source[1][i];
source[2][i];
source[3][i];
The moment the CPU finishes reading source[j][0] into a register, you have an entire cache line of data in your L1 cache. Take advantage of that by having your reads progress linearly over the address space instead of being scattered.
You can also unroll your loops. The CPU likes it when you can execute lots of instructions with no branching.
for (int j = i +1 ; j < sizee ; j += 8)
{
destination[i][j]= source[j][i];
destination[i][j+1]= source[j+1][i];
destination[i][j+2]= source[j+2][i];
destination[i][j+3]= source[j+3][i];
destination[i][j+4]= source[j+4][i];
destination[i][j+5]= source[j+5][i];
destination[i][j+6]= source[j+6][i];
destination[i][j+7]= source[j+7][i];
}
If your CPU has prefetching instructions then you can ask it to start loading the next row of data before you have finished with the current block of memory.
So I am opening a .raw file of a DTMF tone I generated in audacity. I grabbed a canned goertzel algorithm similar to the one on the wikipedia article. It doesn't seem to decode the correct numbers though.
The decoded number also changes depending on what value of N I pass to the algorithm. As far as I understood a higher value of N gives it better accuracy but should not change what number would get decoded correct?
Here is the code,
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
double goertzel(short samples[], double freq, int N)
{
double s_prev = 0.0;
double s_prev2 = 0.0;
double coeff, normalizedfreq, power, s;
int i;
normalizedfreq = freq / 8000;
coeff = 2*cos(2*M_PI*normalizedfreq);
for (i=0; i<N; i++)
{
s = samples[i] + coeff * s_prev - s_prev2;
s_prev2 = s_prev;
s_prev = s;
}
power = s_prev2*s_prev2+s_prev*s_prev-coeff*s_prev*s_prev2;
return power;
}
int main()
{
FILE *fp = fopen("9.raw", "rb");
short *buffer;
float *sample;
int sample_size;
int file_size;
int i=0, x=0;
float frequency_row[] = {697, 770, 852, 941};
float frequency_col[] = {1209, 1336, 1477};
float magnitude_row[4];
float magnitude_col[4];
double result;
fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(file_size);
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
while(!feof(fp))
{
x++;
buffer[x] = getc(fp);
buffer[x] = buffer[x]<<8;
buffer[x] = buffer[x] | getc(fp);
}
for(i=0; i<x; i++)
{
//printf("%#x\n", (unsigned short)buffer[i]);
}
for(i=0; i<4; i++)
{
magnitude_row[i] = goertzel(buffer, frequency_row[i], 8000);
}
for(i=0; i<3; i++)
{
magnitude_col[i] = goertzel(buffer, frequency_col[i], 8000);
}
x=0;
for(i=0; i<4; i++)
{
if(magnitude_row[i] > magnitude_row[x])
x = i;
}
printf("Freq: %f\t Mag: %f\n", frequency_row[x], magnitude_row[x]);
x=0;
for(i=0; i<3; i++)
{
if(magnitude_col[i] > magnitude_col[x])
x = i;
}
printf("Freq: %f\t Mag: %f\n", frequency_col[x], magnitude_col[x]);
return 0;
}
The algorithm is actually tricky to use, even for something as simple as detecting DTMF tones. It is actually effectively a band-pass filter - it singles out a band of frequencies centered around the frequency given. This is actually a good thing - you can't count on your sampled tone to be exactly the frequency you are trying to detect.
The tricky part is attempting to set the bandwidth of the filter - how wide the range of frequencies is that will be filtered to detect a particular tone.
One of the references on the Wikipedia page on the subject (this one to be precise) talks about implementing DTMF tone detection using the Goertzel Algorithm in DSP. The principles are the same for C - to get the bandwidth right you have to use the right combination of provided constants. Apparently there is no simple formula - the paper mentions having to use a brute force search, and provides a list of optimal constants for the DTMF frequencies sampled at 8kHz.
Are you sure the audio data Audacity generated is in big-endian format? You are interpreting it in big-endian, whereas they are normally in little-endian if you run it on x86.
There are some interesting answers here.
First, the goertzel is in fact a "sympathetic" oscillator.
That means that the poles are on the unit circle in DSP terms.
The internal variables s, s_prev, s_prev2 will grow without bound if you run the code on a long block of data containing the expected tone (freq) for that detector.
This means that you need to run a kind of integrate an dump process to get results.
The goertzel works best at discriminating between DTMF digits if you run about 105 to 110 samples into it at a time. So set N = 110 and call the goertzel repeatedly as you run through you data.
Incidentally, real DTMF digits may only last as little as 60 msec and you should report their presence if you find more than 40 msec.
Think about the 110 samples I mentioned, means one call covers 110/8000 = 13.75 msec. If you are very fortunate, then you will see positive output from 4 consecutive iterations of calls to the detector.
In the past I have found that running a pair of detectors in parallel with staggered start times, with provide better coverage of very short tone bursts.