How to integrate my calculation C code with OpenMP - c

I am using a dual channels DAQ card with data stream mode. I wrote some code for analysis/calculation and put them to the main code for operation. However, the FIFO overflow warning sign always occur once its total data reach around 6000 MSamples (the DAQ on board memory is 8GB). I am well-noticed that a complicated calculation might retard the system and cause the overflow but all of the works I wrote are necessary to my experiment which means cannot be replaced (or there is more effective code can let me get the same result). I have heard that the OpenMP might be a solution to boost up the speed, but I am just a beginner in C, how could I implement to my calculation code?
My computer has 64GB RAM and Intel Core i7 processor. I always turn off other unnecessary software when running the data stream code. The code has been optimize as possible as I can, like simplify the hilbert() and use memcpy to pick out a specific range of data points.
This is how I process the data:
1.Install the FFTW source code for the Hilbert transform.
2.For loop to de-interleave pi16Buffer data to ch2Buffer
3.memcpy to get a certain range of data that I am interested put them to another array called ch2newBuffer
4.Do the hilbert() on ch2newBuffer and calculate its absolute number.
5.Find the max value of ch1 and abs(hilbert(ch2newBuffer)).
6.Calculate max(abs(hilbert(ch2))) / max(ch1).
Here is a part of the my DAQ code which in charge to calculation:
void hilbert(const int16* in, fftw_complex* out, fftw_plan plan_forward, fftw_plan plan_backward)
{
// copy the data to the complex array
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
// creat a DFT plan and execute it
//fftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan_forward);
// destroy a plan to prevent memory leak
//fftw_destroy_plan(plan_forward);
int hN = N>>1; // half of the length (N/2)
int numRem = hN; // the number of remaining elements
// multiply the appropriate value by 2
//(those should multiplied by 1 are left intact because they wouldn't change)
for (int i = 1; i < hN; ++i) {
out[i][REAL] *= 2;
out[i][IMAG] *= 2;
}
// if the length is even, the number of the remaining elements decrease by 1
if (N % 2 == 0)
numRem--;
else if (N > 1) {
out[hN][REAL] *= 2;
out[hN][IMAG] *= 2;
}
// set the remaining value to 0
// (multiplying by 0 gives 0, so we don't care about the multiplicands)
memset(&out[hN + 1][REAL], 0, numRem * sizeof(fftw_complex));
// creat a IDFT plan and execute it
//plan = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(plan_backward);
// do some cleaning
//fftw_destroy_plan(plan_backward);
//fftw_cleanup();
// scale the IDFT output
//for (int i = 0; i < N; ++i) {
//out[i][REAL] /= N;
//out[i][IMAG] /= N;
//}
}
float SumBufferData(void* pBuffer, uInt32 u32Size, uInt32 u32SampleBits)
{
// In this routine we sum up all the samples in the buffer. This function
// should be replaced with the user's analysys function
if ( 8 == u32SampleBits )
{
pu8Buffer = (uInt8 *)pBuffer;
for (i = 0; i < u32Size; i++)
{
i64Sum += pu8Buffer[i];
}
}
else
{
pi16Buffer = (int16 *)pBuffer;
fftw_complex(hilbertedch2[N]);
fftw_plan plan_forward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_backward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_BACKWARD, FFTW_ESTIMATE);
ch2Buffer = (int16*)calloc(u32Size / 2, sizeof * ch2Buffer);
ch2newBuffer= (int16*)calloc(u32Size/2, sizeof* ch2newBuffer);
// De-interleave the data from pi16Buffer
for (i = 0; i < u32Size/2 ; i++)
{
ch2Buffer[i] = pi16Buffer[i*2+1];
}
// Pick out the data points range that we are interested
memcpy(ch2newBuffer, &ch2Buffer[6944], 1024 * sizeof(ch2Buffer[0]));
// Do the hilbert transform to these data points
hilbert(ch2newBuffer, hilbertedch2, plan_forward, plan_backward);
fftw_destroy_plan(plan_forward);
fftw_destroy_plan(plan_backward);
//Find max value in each segs of ch1 and ch2
for (i = 128; i < 200 ; i++)
{
if (pi16Buffer[i*2] > max1)
max1 = pi16Buffer[i*2];
}
for (i = 0; i < 1024; i++)
{
if (fabs(hilbertedch2[i][IMAG]) > max2)
max2 = fabs(hilbertedch2[i][IMAG]);
}
Corrected = max2 / max1 / N; // Calculate the signal correction
}
free(ch2Buffer);
free(ch2newBuffer);
return Corrected;
}

Loop are typically a good start for parallelism, for instance:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
or
#pragma omp parallel for reduction(max:max2)
for (i = 0; i < 1024; i++)
{
float tmp = fabs(hilbertedch2[i][IMAG]);
max2 = (max2 > tmp) ? max2 : tmp.
}
That being said, you need to profile your code find out where the execution takes the most time and try to parallelized if possible. However, looking at what you have posted, I do not see a lot of parallelism opportunity there.

Related

Why is FFT of (A+B) different from FFT(A) + FFT(B)?

I have been fighting with a very weird bug for almost a month. Asking you guys is my last hope. I wrote a program in C that integrates the 2d Cahn–Hilliard equation using the Implicit Euler (IE) scheme in Fourier (or reciprocal) space:
Where the "hats" mean that we are in Fourier space: h_q(t_n+1) and h_q(t_n) are the FTs of h(x,y) at times t_n and t_(n+1), N[h_q] is the nonlinear operator applied to h_q, in Fourier space, and L_q is the linear one, again in Fourier space. I don't want to go too much into the details of the numerical method I am using, since I am sure that the problem is not coming from there (I tried using other schemes).
My code is actually quite simple. Here is the beginning, where basically I declare variables, allocate memory and create the plans for the FFTW routines.
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
# include <math.h>
# include <fftw3.h>
# define pi M_PI
int main(){
// define lattice size and spacing
int Nx = 150; // n of points on x
int Ny = 150; // n of points on y
double dx = 0.5; // bin size on x and y
// define simulation time and time step
long int Nt = 1000; // n of time steps
double dt = 0.5; // time step size
// number of frames to plot (at denominator)
long int nframes = Nt/100;
// define the noise
double rn, drift = 0.05; // punctual drift of h(x)
srand(666); // seed the RNG
// other variables
int i, j, nt; // variables for space and time loops
// declare FFTW3 routine
fftw_plan FT_h_hft; // routine to perform fourier transform
fftw_plan FT_Nonl_Nonlft;
fftw_plan IFT_hft_h; // routine to perform inverse fourier transform
// declare and allocate memory for real variables
double *Linft = fftw_alloc_real(Nx*Ny);
double *Q2 = fftw_alloc_real(Nx*Ny);
double *qx = fftw_alloc_real(Nx);
double *qy = fftw_alloc_real(Ny);
// declare and allocate memory for complex variables
fftw_complex *dh = fftw_alloc_complex(Nx*Ny);
fftw_complex *dhft = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonl = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonlft = fftw_alloc_complex(Nx*Ny);
// create the FFTW plans
FT_h_hft = fftw_plan_dft_2d ( Nx, Ny, dh, dhft, FFTW_FORWARD, FFTW_ESTIMATE );
FT_Nonl_Nonlft = fftw_plan_dft_2d ( Nx, Ny, Nonl, Nonlft, FFTW_FORWARD, FFTW_ESTIMATE );
IFT_hft_h = fftw_plan_dft_2d ( Nx, Ny, dhft, dh, FFTW_BACKWARD, FFTW_ESTIMATE );
// open file to store the data
char acstr[160];
FILE *fp;
sprintf(acstr, "CH2d_IE_dt%.2f_dx%.3f_Nt%ld_Nx%d_Ny%d_#f%.ld.dat",dt,dx,Nt,Nx,Ny,Nt/nframes);
After this preamble, I initialise my function h(x,y) with a uniform random noise, and I also take the FT of it. I set the imaginary part of h(x,y), which is dh[i*Ny+j][1] in the code, to 0, since it is a real function. Then I calculate the wavevectors qx and qy, and with them, I compute the linear operator of my equation in Fourier space, which is Linft in the code. I consider only the - fourth derivative of h as the linear term, so that the FT of the linear term is simply -q^4... but again, I don't want to go into the details of my integration method. The question is not about it.
// generate h(x,y) at initial time
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
rn = (double) rand()/RAND_MAX; // extract a random number between 0 and 1
dh[i*Ny+j][0] = drift-2.0*drift*rn; // shift of +-drift
dh[i*Ny+j][1] = 0.0;
}
}
// execute plan for the first time
fftw_execute (FT_h_hft);
// calculate wavenumbers
for (i = 0; i < Nx; i++) { qx[i] = 2.0*i*pi/(Nx*dx); }
for (i = 0; i < Ny; i++) { qy[i] = 2.0*i*pi/(Ny*dx); }
for (i = 1; i < Nx/2; i++) { qx[Nx-i] = -qx[i]; }
for (i = 1; i < Ny/2; i++) { qy[Ny-i] = -qy[i]; }
// calculate the FT of the linear operator
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Q2[i*Ny+j] = qx[i]*qx[i] + qy[j]*qy[j];
Linft[i*Ny+j] = -Q2[i*Ny+j]*Q2[i*Ny+j];
}
}
Then, finally, it comes the time loop. Essentially, what I do is the following:
Every once in a while, I save the data to a file and print some information on the terminal. In particular, I print the highest value of the FT of the Nonlinear term. I also check if h(x,y) is diverging to infinity (it shouldn't happen!),
Calculate h^3 in direct space (that is simply dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0]). Again, the imaginary part is set to 0,
Take the FT of h^3,
Obtain the complete Nonlinear term in reciprocal space (that is N[h_q] in the IE algorithm written above) by computing -q^2*(FT[h^3] - FT[h]). In the code, I am referring to the lines Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]) and the one below, for the imaginary part. I do this because:
Advance in time using the IE method, transform back in direct space, and then normalise.
Here is the code:
for(nt = 0; nt < Nt; nt++) {
if((nt % nframes)== 0) {
printf("%.0f %%\n",((double)nt/(double)Nt)*100);
printf("Nonlft %.15f \n",Nonlft[(Nx/2)*(Ny/2)][0]);
// write data to file
fp = fopen(acstr,"a");
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
fprintf(fp, "%4d %4d %.6f\n", i, j, dh[i*Ny+j][0]);
}
}
fclose(fp);
}
// check if h is going to infinity
if (isnan(dh[1][0])!=0) {
printf("crashed!\n");
return 0;
}
// calculate nonlinear term h^3 in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
}
}
// Implicit Euler scheme in Fourier space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dhft[i*Ny+j][0] = (dhft[i*Ny+j][0] + dt*Nonlft[i*Ny+j][0])/(1.0 - dt*Linft[i*Ny+j]);
dhft[i*Ny+j][1] = (dhft[i*Ny+j][1] + dt*Nonlft[i*Ny+j][1])/(1.0 - dt*Linft[i*Ny+j]);
}
}
// transform h back in direct space
fftw_execute (IFT_hft_h);
// normalize
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dh[i*Ny+j][0] = dh[i*Ny+j][0] / (double) (Nx*Ny);
dh[i*Ny+j][1] = dh[i*Ny+j][1] / (double) (Nx*Ny);
}
}
}
Last part of the code: empty the memory and destroy FFTW plans.
// terminate the FFTW3 plan and free memory
fftw_destroy_plan (FT_h_hft);
fftw_destroy_plan (FT_Nonl_Nonlft);
fftw_destroy_plan (IFT_hft_h);
fftw_cleanup();
fftw_free(dh);
fftw_free(Nonl);
fftw_free(qx);
fftw_free(qy);
fftw_free(Q2);
fftw_free(Linft);
fftw_free(dhft);
fftw_free(Nonlft);
return 0;
}
If I run this code, I obtain the following output:
0 %
Nonlft 0.0000000000000000000
1 %
Nonlft -0.0000000000001353512
2 %
Nonlft -0.0000000000000115539
3 %
Nonlft 0.0000000001376379599
...
69 %
Nonlft -12.1987455309071730625
70 %
Nonlft -70.1631962517720353389
71 %
Nonlft -252.4941743351609204637
72 %
Nonlft 347.5067875825179726235
73 %
Nonlft 109.3351142318568633982
74 %
Nonlft 39933.1054502610786585137
crashed!
The code crashes before reaching the end and we can see that the Nonlinear term is diverging.
Now, the thing that doesn't make sense to me is that if I change the lines in which I calculate the FT of the Nonlinear term in the following way:
// calculate nonlinear term h^3 -h in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
}
}
Which means that I am using this definition:
instead of this one:
Then the code is perfectly stable and no divergence happens! Even for billions of time steps! Why does this happen, since the two ways of calculating Nonlft should be equivalent?
Thank you very much to anyone who will take the time to read all of this and give me some help!
EDIT: To make things even more weird, I should point out that this bug does NOT happen for the same system in 1D. In 1D both methods of calculating Nonlft are stable.
EDIT: I add a short animation of what happens to the function h(x,y) just before crashing. Also: I quickly re-wrote the code in MATLAB, which uses Fast Fourier Transform functions based on the FFTW library, and the bug is NOT happening... the mystery deepens.
I solved it!!
The problem was the calculation of the Nonl term:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
That needs to be changed to:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -3.0*dh[i*Ny+j][0]*dh[i*Ny+j][1]*dh[i*Ny+j][1];
Nonl[i*Ny+j][1] = -dh[i*Ny+j][1]*dh[i*Ny+j][1]*dh[i*Ny+j][1] +3.0*dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][1];
In other words: I need to consider dh as a complex function (even though it should be real).
Basically, because of stupid rounding errors, the IFT of the FT of a real function (in my case dh), is NOT purely real, but will have a very small imaginary part. By setting Nonl[i*Ny+j][1] = 0.0 I was completely ignoring this imaginary part.
The issue, then, was that I was recursively summing FT(dh), dhft, and an object obtained using the IFT(FT(dh)), this is Nonlft, but ignoring the residual imaginary parts!
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
Obviously, calculating Nonlft as dh^3 -dh and then doing
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
Avoided the problem of doing this "mixed" sum.
Phew... such a relief! I wish I could assign the bounty to myself! :P
EDIT: I'd like to add that, before using the fftw_plan_dft_2d functions, I was using fftw_plan_dft_r2c_2d and fftw_plan_dft_c2r_2d (real-to-complex and complex-to-real), and I was seeing the same bug. However, I suppose that I couldn't have solved it if I didn't switch to fftw_plan_dft_2d, since the c2r function automatically "chops off" the residual imaginary part coming from the IFT. If this is the case and I'm not missing something, I think that this should be written somewhere on the FFTW website, to prevent users from running into problems like this. Something like "r2c and c2r transforms are not good to implement pseudospectral methods".
EDIT: I found another SO question that addresses exactly the same problem.

How to implement summation using parallel reduction in OpenCL?

I'm trying to implement a kernel which does parallel reduction. The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does.
__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];
for(int step = workgroup_size/2; step>0; step/=2){
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id < step){
localSum[local_id] += localSum[local_id + step];
}
}
if(local_id == 0){
partialSum[get_group_id(0)] = localSum[0];
}}
Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. Below is the code which sets up the values for the summation.
size_t global[1];
size_t local[1];
const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;
local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));
int i;
for (i = 0; i < count; i++){
input[i] = (float)i+1;
}
I'm thinking it has something to do when the size of the input is not a power of two? I noticed it begins to go off for numbers around 8000 and beyond. Any assistance is welcome. Thanks.
I'm thinking it has something to do when the size of the input is not a power of two?
Yes. Consider what happens when you try to reduce, say, 9 elements. Suppose you launch 1 work-group of 9 work-items:
for (int step = workgroup_size / 2; step > 0; step /= 2){
// At iteration 0: step = 9 / 2 = 4
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id < step) {
// Branch taken by threads 0 to 3
// Only 8 numbers added up together!
localSum[local_id] += localSum[local_id + step];
}
}
You're never summing the 9th element, hence the reduction is incorrect. An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two.

OpenCV Fourier Magnitude - doesn't seem correct

I believe I am having a scaling issue in trying to convert the Fourier magnitude spectrum to an Image.
I am working on my own visual odometry project to determine the translation and rotation between consequtive frames from a camera input. I have been successful with determining translation using phase correlation of the fourier transform, however part of determining the rotation requires the magnitude spectrum to be convolved. Essentially the magnitude I have produced does not seem correct, as below.
Original Image:
Magnitude, with the 'mag = 255*(mag/max)' scaling
Magnitude, without the scaling
Unfortunately I would require help as to the function I am using to determine the magnitude, I believe my error is in the scaling of the magnitude but am unsure exactly. This issue has had me for some time and your input would be appreciated, thankyou.
void iplimage_dft(IplImage* img)
{
IplImage* img1, * img2;
fftw_complex* in, * dft, * idft;
fftw_plan plan_f, plan_b;
int i, j, k, w, h, N;
/* Copy input image */
img1 = cvCloneImage(img);
w = img1->width;
h = img1->height;
N = w * h;
/* Allocate input data for FFTW */
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
dft = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
/* Create plans */
plan_f = fftw_plan_dft_2d(w, h, in, dft, FFTW_FORWARD, FFTW_ESTIMATE);
/* Populate input data in row-major order */
for (i = 0, k = 0; i < h; i++)
{
for (j = 0; j < w; j++, k++)
{
in[k][0] = ((uchar*)(img1->imageData + i * img1->widthStep))[j];
in[k][1] = 0.0;
}
}
/* Forward & inverse DFT */
fftw_execute(plan_f);
/* Create output image */
img2 = cvCreateImage(cvSize(w, h), 8, 1);
//Find the maximum value among the magnitudes
double max=0;
double mag=0;
for (i = 0, k = 1; i < h; i++){
for (j = 0; j < w; j++, k++){
mag = sqrt(pow(dft[k][0],2) + pow(dft[k][1],2));
if (max < mag)
max = mag;
}
}
// Convert DFT result to output image
for (i = 0, k = 0; i < h; i++)
{
for (j = 0; j < w; j++, k++)
{
double mag = sqrt(pow(dft[k][0],2) + pow(dft[k][1],2));
mag = 255*(mag/max);
((uchar*)(img2->imageData + i * img2->widthStep))[j] = mag;
}
}
cvShowImage("iplimage_dft(): original", img1);
cvShowImage("iplimage_dft(): result", img2);
//cvSaveImage("iplimage_dft.png", img2,0 );
cvWaitKey(0);
/* Free memory */
fftw_destroy_plan(plan_f);
fftw_free(in);
fftw_free(dft);
cvReleaseImage(&img1);
cvReleaseImage(&img2);
}
int main( int argc, char** argv )
{
argv[1] = "image1.jpg";
IplImage *img3 = cvLoadImage( argv[1], CV_LOAD_IMAGE_GRAYSCALE );
iplimage_dft(img3);
return 0;
}
The spectrum of many images have characteristics like this - several relatively high peaks with the rest of the field quite small in magnitude. It looks like you're normalizing right, it's just that the details are lost because the magnitude of much of the spectrum is very small. I've often found it more useful to use log(mag(spectrum)) (or even log(log(mag(spectrum))) in some cases) to generate an image if you're wanting to inspect details.

Using shared memory in CUDA without reducing threads

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.

Doing FFT in realtime

I want to do the FFT of an audio signal in real time, meaning while the person is speaking in the microphone. I will fetch the data (I do this with portaudio, if it would be easier with wavein I would be happy to use that - if you can tell me how). Next I am using the FFTW library - I know how to perform 1D, 2D (real&complex) FFT, but I am not so sure how to do this, since I would have to do a 3D FFT to get frequency, amplitude (this would determine the color gradient) and time. Or is it just a 2D FFT, and I get amplitude and frequency?
I use a Sliding DFT, which is many times faster than an FFT in the case where you need to do a fourier transform each time a sample arrives in the input buffer.
It's based on the fact that once you have performed a fourier transform for the last N samples, and a new sample arrives, you can "undo" the effect of the oldest sample, and apply the effect of the latest sample, in a single pass through the fourier data! This means that the sliding DFT performance is O(n) compared with O(Log2(n) times n) for the FFT. Also, there's no restriction to powers of two for the buffer size to maintain performance.
The complete test program below compares the sliding DFT with fftw. In my production code I've optimized the below code to unreadibility, to make it three times faster.
#include <complex>
#include <iostream>
#include <time.h>
#include <math_defines.h>
#include <float.h>
#define DO_FFTW // libfftw
#define DO_SDFT
#if defined(DO_FFTW)
#pragma comment( lib, "d:\\projects\\common\\fftw\\libfftw3-3.lib" )
namespace fftw {
#include <fftw/fftw3.h>
}
fftw::fftw_plan plan_fwd;
fftw::fftw_plan plan_inv;
#endif
typedef std::complex<double> complex;
// Buffer size, make it a power of two if you want to improve fftw
const int N = 750;
// input signal
complex in[N];
// frequencies of input signal after ft
// Size increased by one because the optimized sdft code writes data to freqs[N]
complex freqs[N+1];
// output signal after inverse ft of freqs
complex out1[N];
complex out2[N];
// forward coeffs -2 PI e^iw -- normalized (divided by N)
complex coeffs[N];
// inverse coeffs 2 PI e^iw
complex icoeffs[N];
// global index for input and output signals
int idx;
// these are just there to optimize (get rid of index lookups in sdft)
complex oldest_data, newest_data;
//initilaize e-to-the-i-thetas for theta = 0..2PI in increments of 1/N
void init_coeffs()
{
for (int i = 0; i < N; ++i) {
double a = -2.0 * PI * i / double(N);
coeffs[i] = complex(cos(a)/* / N */, sin(a) /* / N */);
}
for (int i = 0; i < N; ++i) {
double a = 2.0 * PI * i / double(N);
icoeffs[i] = complex(cos(a),sin(a));
}
}
// initialize all data buffers
void init()
{
// clear data
for (int i = 0; i < N; ++i)
in[i] = 0;
// seed rand()
srand(857);
init_coeffs();
oldest_data = newest_data = 0.0;
idx = 0;
}
// simulating adding data to circular buffer
void add_data()
{
oldest_data = in[idx];
newest_data = in[idx] = complex(rand() / double(N));
}
// sliding dft
void sdft()
{
complex delta = newest_data - oldest_data;
int ci = 0;
for (int i = 0; i < N; ++i) {
freqs[i] += delta * coeffs[ci];
if ((ci += idx) >= N)
ci -= N;
}
}
// sliding inverse dft
void isdft()
{
complex delta = newest_data - oldest_data;
int ci = 0;
for (int i = 0; i < N; ++i) {
freqs[i] += delta * icoeffs[ci];
if ((ci += idx) >= N)
ci -= N;
}
}
// "textbook" slow dft, nested loops, O(N*N)
void ft()
{
for (int i = 0; i < N; ++i) {
freqs[i] = 0.0;
for (int j = 0; j < N; ++j) {
double a = -2.0 * PI * i * j / double(N);
freqs[i] += in[j] * complex(cos(a),sin(a));
}
}
}
double mag(complex& c)
{
return sqrt(c.real() * c.real() + c.imag() * c.imag());
}
void powr_spectrum(double *powr)
{
for (int i = 0; i < N/2; ++i) {
powr[i] = mag(freqs[i]);
}
}
int main(int argc, char *argv[])
{
const int NSAMPS = N*10;
clock_t start, finish;
#if defined(DO_SDFT)
// ------------------------------ SDFT ---------------------------------------------
init();
start = clock();
for (int i = 0; i < NSAMPS; ++i) {
add_data();
sdft();
// Mess about with freqs[] here
//isdft();
if (++idx == N) idx = 0; // bump global index
if ((i % 1000) == 0)
std::cerr << i << " iters..." << '\r';
}
finish = clock();
std::cout << "SDFT: " << NSAMPS / ((finish-start) / (double)CLOCKS_PER_SEC) << " fts per second." << std::endl;
double powr1[N/2];
powr_spectrum(powr1);
#endif
#if defined(DO_FFTW)
// ------------------------------ FFTW ---------------------------------------------
plan_fwd = fftw::fftw_plan_dft_1d(N, (fftw::fftw_complex *)in, (fftw::fftw_complex *)freqs, FFTW_FORWARD, FFTW_MEASURE);
plan_inv = fftw::fftw_plan_dft_1d(N, (fftw::fftw_complex *)freqs, (fftw::fftw_complex *)out2, FFTW_BACKWARD, FFTW_MEASURE);
init();
start = clock();
for (int i = 0; i < NSAMPS; ++i) {
add_data();
fftw::fftw_execute(plan_fwd);
// mess about with freqs here
//fftw::fftw_execute(plan_inv);
if (++idx == N) idx = 0; // bump global index
if ((i % 1000) == 0)
std::cerr << i << " iters..." << '\r';
}
// normalize fftw's output
for (int j = 0; j < N; ++j)
out2[j] /= N;
finish = clock();
std::cout << "FFTW: " << NSAMPS / ((finish-start) / (double)CLOCKS_PER_SEC) << " fts per second." << std::endl;
fftw::fftw_destroy_plan(plan_fwd);
fftw::fftw_destroy_plan(plan_inv);
double powr2[N/2];
powr_spectrum(powr2);
#endif
#if defined(DO_SDFT) && defined(DO_FFTW)
// ------------------------------ ---------------------------------------------
const double MAX_PERMISSIBLE_DIFF = 1e-11; // DBL_EPSILON;
double diff;
// check my ft gives same power spectrum as FFTW
for (int i = 0; i < N/2; ++i)
if ( (diff = abs(powr1[i] - powr2[i])) > MAX_PERMISSIBLE_DIFF)
printf("Values differ by more than %g at index %d. Diff = %g\n", MAX_PERMISSIBLE_DIFF, i, diff);
#endif
return 0;
}
If you need amplitude, frequency and time in one graph, then the transform is known as a Time-Frequency decomposition. The most popular one is called the Short Time Fourier Transform. It works as follows:
1. Take a small portion of the signal (say 1 second)
2. Window it with a small window (say 5 ms)
3. Compute the 1D fourier transform of the windowed signal.
4. Move the window by a small amount (2.5 ms)
5. Repeat above steps until end of signal.
6. All of this data is entered into a matrix that is then used to create the kind of 3D representation of the signal that shows its decomposition along frequency, amplitude and time.
The length of the window will decide the resolution you are able to obtain in frequency and time domains. Check here for more details on STFT and search for "Robi Polikar"'s tutorials on wavelet transforms for a layman's introduction to the above.
Edit 1:
You take a windowing function (there are innumerable functions out there - here is a list. Most intuitive is a rectangular window but the most commonly used are the Hamming/Hanning window functions. You can follow the steps below if you have a paper-pencil in hand and draw it along.
Assume that the signal that you have obtained is 1 sec long and is named x[n]. The windowing function is 5 msec long and is named w[n]. Place the window at the start of the signal (so the end of the window coincides with the 5ms point of the signal) and multiply the x[n] and w[n] like so:
y[n] = x[n] * w[n] - point by point multiplication of the signals.
Take an FFT of y[n].
Then you shift the window by a small amount (say 2.5 msec). So now the window stretches from 2.5ms to 7.5 ms of the signal x[n]. Repeat the multiplication and FFT generation steps. In other words, you have an overlap of 2.5 msec. You will see that changing the length of the window and the overlap gives you different resolutions on the time and Frequency axis.
Once you do this, you need to feed all the data into a matrix and then have it displayed. The overlap is for minimising the errors that might arise at boundaries and also to get more consistent measurements over such short time frames.
P.S: If you had understood STFT and other time-frequency decompositions of a signal, then you should have had no problems with steps 2 and 4. That you have not understood the above mentioned steps makes me feel like you should revisit time-frequency decompositions also.
You can create a realtime FFT by choosing a short time-span and analysing (FFT'ing) just that time-span. You can probably get away with just selecting non-overlapping timespans of say 100-500 milliseconds; the analytically purer way to do this would be using a sliding-window (again of e.g. 100-500 ms), but that is often unnecessary and you can show nice graphics with the non-overlapping timespans without much processing power.
Real-time FFT means completely different from what you just described. It means that for given N and X[N] your algorithm gives Fx[i] while incrementing value i. Meaning, proceeding value does not compute until current value computation completed. This is completely different from what you just described.
Hardware usually uses FFT with around 1k-16k points. Fixed N, not real-time computation. Moving window FFT as described with previous answers.

Resources