This is my first question. I'm trying to parallelize with openMP a 2d haar transform function in C. I obtained it here and modified accordingly.
The program takes a black&white image, puts it into a matrix and computes one level of the haar wavelet transform. In the end it normalizes the values and writes the transformed image on the disk.
This is a resulting image 1 level of HDT
My problem is that the parallelized version runs quite slower than the serial one.
For now I attach here a snippet from the main part I want to parallelize (later on I can put all the surrounding code):
void haar_2d ( int m, int n, double u[] )
// m & n are the dimentions (every image is a perfect square)
//u is the input array in **(non column-major!)** row-major order</del>
int i;
int j;
int k;
double s;
double *v;
int tid, nthreads, chunk;
s = sqrt ( 2.0 );
v = ( double * ) malloc ( m * n * sizeof ( double ) );
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
{
v[i+j*m] = u[i+j*m];
}
}
/*
Determine K, the largest power of 2 such that K <= M.
*/
k = 1;
while ( k * 2 <= m )
{
k = k * 2;
}
/* Transform all columns. */
while ( n/2 < k ) // just 1 level of transformation
{
k = k / 2;
clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid)
{
tid = omp_get_thread_num();
printf("Thread %d starting...\n",tid);
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end parallel
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf ( "Time for COLUMNS: %f ms\n", time_spent * 1000);
}//end while
// [...]code for rows
free ( v );
return;}
The timings more or less are:
Time for COLUMNS: 160.519000 ms // parallel
Time for COLUMNS: 62.842000 ms // serial
I have tried to re-arrange the pragmas in lots of different ways eg with static schedule, with sections, task and so on, also re-arranging the data scopes of the variables and dynamically allocating inside parallel regions.
I thought it would be simple to parallelize a 2-level for, but now it has been two days that I'm struggling. Seeking for your help guys, I've already checked out near all the related questions here, but still not able to go on or, at least, understand the reasons. Thank you in advance.
(CPU Intel Core i3-4005U CPU # 1.70GHz Ă— 4 threads, 2 cores )
UPDATE:
1) What about m & n, it is supposed to implement also rectangled images one day, so I just left it there.
2) I figured out that u is actually a normal array with a linearized matrix inside, that is row by row (I use PGM images).
3) The memcpy is a better option, so now I'm using it.
What about the main topic, I've tried to divide the job over n by spawning a task for each chunk and the result is a littel bit faster thatn the serial code.
Now I know that the input matrix u is in good row-major order, the 2 fors seem to proceed accordingly, but I'm not sure about the timings: using both omp_get_wtime() and clock() I don't know how to measure the speedup. I did tests with different image sizes, from 16x16 up to 4096x4096, and the parallel version seems to be slower with clock() and faster with omp_get_wtime() and gettimeofday().
Do you have some suggestions of how to handle it correctly with OpenMP, or at least how to measure correctly the speedup?
while ( n/2 < k )
{
k = k / 2;
double start_time = omp_get_wtime();
// clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(k)
{
nthreads = omp_get_num_threads();
#pragma omp single
{
printf("Number of threads = %d\n", nthreads);
int chunk = n/nthreads;
printf("Chunks size = %d\n", chunk);
printf("Thread %d is starting the tasks.\n", omp_get_thread_num());
int h;
for(h=0;h<n;h = h + chunk){
printf("FOR CYCLE i=%d\n", h);
#pragma omp task shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(h,k)
{
tid = omp_get_thread_num();
printf("Thread %d starts at %d position\n", tid , h);
for ( j = h; j < h + chunk; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
}// end task
}//end launching for
#pragma omp taskwait
}//end single
}//end parallel region
// clock_t end = clock();
// double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
// printf ( "COLUMNS: %f ms\n", time_spent * 1000);
double time = omp_get_wtime() - start_time;
printf ( "COLUMNS: %f ms\n", time*1000);
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end while
I have a few questions that deeply concern me about your code.
m & n are the dimentions (every image is a perfect square)
Then why are there two size parameters?
u is the input array in column-major order
This is an incredibly bad idea. C uses a row-major ordering for memory, so column-major indexing leads to strided memory access. This is very, very bad for performance. If at all possible, you need to fix this.
Because both u and v are linearized matrices, then this
for (int j = 0; j < n; j++) {
for (int i = 0; i < m; i++) {
v[i + j * m] = u[i + j * m];
}
}
can be replaced with a call to memcpy.
memcpy(v, u, m * n * sizeof(double));
On to your issue. The reason that your version using OpenMP is slower is because all of your threads are doing the same thing. This isn't useful and leads to bad things like false sharing. You need to use each thread's id (tid in your code) to partition the data across the threads; keeping in mind that false sharing is bad.
The problem was that I was using clock() instead of omp_get_wtime(), thanks to Z boson.
Related
I have a function in C which I have to parallelize using OpenMP with static scheduling for n threads
void resolveCollisions(){
int i,j;
double dx,dy,dz,md;
for(i=0;i<bodies-1;i++)
for(j=i+1;j<bodies;j++){
md = masses[i]+masses[j];
dx = fabs(positions[i].x-positions[j].x);
dy = fabs(positions[i].y-positions[j].y);
dz = fabs(positions[i].z-positions[j].z);
if(dx<md && dy<md && dz<md){
vector temp = velocities[i];
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
So in order to parallelize this I added a #pragma omp parallel for directive to parallelize the outer loop across the n threads. I also added the static scheduling tag which I have to use. I also put the num_threads(n) which takes the n from the function parameters to know the desired number of threads. I also thought about adding a critical section to prevent race conditions when updating the velocities array.
void resolveCollisions_openMP-static(int n) {
int i, j;
double dx, dy, dz, md;
#pragma omp parallel for schedule(static) num_threads(n)
for (i = 0; i < bodies - 1; i++) {
for (j = i + 1; j < bodies; j++) {
md = masses[i] + masses[j];
dx = fabs(positions[i].x - positions[j].x);
dy = fabs(positions[i].y - positions[j].y);
dz = fabs(positions[i].z - positions[j].z);
if (dx < md && dy < md && dz < md) {
vector temp = velocities[i];
#pragma omp critical
{
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
}
}
When I run this function though it gives me wrong results. I imagine that it has something to do with the inner loop using i, in order to give value to j in j=i+1. I don't know how to approach to fix this or if this is the actual issue or if it's not. I would appreciate any help. Thank you
I'd like to generate a random matrix with OpenMP like it were generated by a sequential program, i.e. if any sequential matrix generator outputs me a matrix like the following one:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0
3.0 4.0 5.0 6.0
I want the parallel OpenMP version of the same program to generate the same matrix with no interleaved rows.
Here is how I gradually approached the problem.
Given my serial generator C function generating a matrix as a 1D array:
void generate_matrix_array(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
First, I naively tried the #pragma omp parallel for directive to outer for loop; however, there's no guarantee about row ordering, since thread execution gets interleaved, so they get generated in a non-deterministic order.
Adding the ordered option would solve the issue at the price of making useless multithreading in this particular case.
In order to solve the issue, I tried to partition by hand the matrix array so that thread i would generate the i-th slice of it:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
#pragma omp parallel \
shared(v)
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int rows_per_thread = round(rows / (double) nthreads);
int rem_rows = rows % (nthreads - 1) != 0?
rows % (nthreads - 1):
rows_per_thread;
int local_rows = (tid == 0)?
rows_per_thread:
rem_rows;
int lower_row = tid * local_rows;
int upper_row = ((tid + 1) * local_rows);
printf(
"[T%d] receiving %d of %d rows from row %d to %d\n",
tid,
local_rows,
rows,
lower_row,
upper_row - 1
);
printf("\n");
fflush(stdout);
for (int i = lower_row; i < upper_row; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
}
However, despite matrix vector gets properly divided among threads, for some reason unknown to me, every thread generates its rows into the matrix in a non-deterministic order, i.e. if I want to generate a 8x8 matrix with 4 threads and thread 3 is assigned to rows 4 and 5, he will generate two contiguous rows in the matrix array but in the wrong position every time, like if I didn't perform any partitioning and the omp parallel for directive was in place.
I skeptically tried, at last, to get back to naive approach by specifying shared(v) and schedule(static, 16) options to omp parallel for directive and it 'magically' happens to work:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
int nthreads = omp_get_max_threads();
int chunk_size = (rows * columns) / nthreads;
#pragma omp parallel for \
shared(v) \
schedule(static, chunk_size)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
The schedule option is being added since I read somewhere else that it gets rid of cache conflicts. Edit: Looks like schedule splits up data to thread in a round-robin fashion according to a given chunk size; so if I share N/nthreads-sized chunks among threads, data will be assigned in a single round.
Any question? YES!!!
Now, I'd like to know whether I missed or failed some consideration about the problem, since I'm not convinced about the fairness of my last version of the program, despite the fact that it is working.
I'm attempting to parallelize a Jacobi grid solver using OpenMP.
When 1 thread is used:
As it stands, the code executes correctly when only a single thread is assigned, and produces the same results as a reference single-threaded function (not shown).
The while loop breaks when the difference variable is less than "0.01000"(as it should)
When two or more threads are used:
The code runs through the outer while loop only once.
The difference value from the first thread is way above 0.0100 (as it should be), but the difference value given by the other thread(s) is below it instantaneously, so the loop breaks, without doing any of the calculations.
I've tested a lot compilation iterations of strategically placing respective variables in either the shared/private/reduction clauses hoping to get the diff value to accumulate correctly over all threads used. I get that the "diff" variable should be shared by all threads, but what I've tried has not worked to accumulate the values from all threads. I'm not sure what else I can try?
Thanks for your time and input
int
compute_using_omp_jacobi (grid_t *grid, int num_threads)
{
/////////////////////////////////////////////////////////
int i, j;
int num_iter = 0;
int done = 0;
double diff;
float old, new;
float eps = 1e-2; /* Convergence criteria. */
int num_elements;
omp_set_num_threads(num_threads);
#pragma omp parallel default(none) shared(grid, eps, done, diff) private ( i, j, old, new, num_elements) reduction (+:num_iter)
while(!done) { /* While we have not converged yet. */
diff = 0.0;
num_elements = 0;
#pragma omp for reduction (+: diff) collapse(2)
for (i = 1; i < (grid->dim - 1); i++)
for (j = 1; j < (grid->dim - 1); j++) {
old = grid->element[i * grid->dim + j]; /* Store old value of grid point. */
/* Apply the update rule. */
new = 0.25 * (grid->element[(i - 1) * grid->dim + j] +\
grid->element[(i + 1) * grid->dim + j] +\
grid->element[i * grid->dim + (j + 1)] +\
grid->element[i * grid->dim + (j - 1)]);
grid->element[i * grid->dim + j] = new; /* Update the grid-point value. */
diff = diff + fabs(new - old); /* Calculate the difference in values. */
num_elements++;
//printf ("DIFF %f.", diff);
}
/* End of an iteration. Check for convergence. */
diff = diff/num_elements;
printf ("Iteration %d. DIFF: %f.\n", num_iter, diff);
// printf ("number of elements %d.", num_elements);
num_iter++;
if (diff < eps)
done = 1;
}
return num_iter;
}
You can't parallelize the while loop, since the values for grid->element in each iteration depend on the values from the previous iteration.
You'll have to move the #pragma omp parallel inside the while (to before the first for loop).
num_elements should be named in the reduction clause, and new and old should be declared within the body of the inner for loop.
I have been fighting with a very weird bug for almost a month. Asking you guys is my last hope. I wrote a program in C that integrates the 2d Cahn–Hilliard equation using the Implicit Euler (IE) scheme in Fourier (or reciprocal) space:
Where the "hats" mean that we are in Fourier space: h_q(t_n+1) and h_q(t_n) are the FTs of h(x,y) at times t_n and t_(n+1), N[h_q] is the nonlinear operator applied to h_q, in Fourier space, and L_q is the linear one, again in Fourier space. I don't want to go too much into the details of the numerical method I am using, since I am sure that the problem is not coming from there (I tried using other schemes).
My code is actually quite simple. Here is the beginning, where basically I declare variables, allocate memory and create the plans for the FFTW routines.
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
# include <math.h>
# include <fftw3.h>
# define pi M_PI
int main(){
// define lattice size and spacing
int Nx = 150; // n of points on x
int Ny = 150; // n of points on y
double dx = 0.5; // bin size on x and y
// define simulation time and time step
long int Nt = 1000; // n of time steps
double dt = 0.5; // time step size
// number of frames to plot (at denominator)
long int nframes = Nt/100;
// define the noise
double rn, drift = 0.05; // punctual drift of h(x)
srand(666); // seed the RNG
// other variables
int i, j, nt; // variables for space and time loops
// declare FFTW3 routine
fftw_plan FT_h_hft; // routine to perform fourier transform
fftw_plan FT_Nonl_Nonlft;
fftw_plan IFT_hft_h; // routine to perform inverse fourier transform
// declare and allocate memory for real variables
double *Linft = fftw_alloc_real(Nx*Ny);
double *Q2 = fftw_alloc_real(Nx*Ny);
double *qx = fftw_alloc_real(Nx);
double *qy = fftw_alloc_real(Ny);
// declare and allocate memory for complex variables
fftw_complex *dh = fftw_alloc_complex(Nx*Ny);
fftw_complex *dhft = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonl = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonlft = fftw_alloc_complex(Nx*Ny);
// create the FFTW plans
FT_h_hft = fftw_plan_dft_2d ( Nx, Ny, dh, dhft, FFTW_FORWARD, FFTW_ESTIMATE );
FT_Nonl_Nonlft = fftw_plan_dft_2d ( Nx, Ny, Nonl, Nonlft, FFTW_FORWARD, FFTW_ESTIMATE );
IFT_hft_h = fftw_plan_dft_2d ( Nx, Ny, dhft, dh, FFTW_BACKWARD, FFTW_ESTIMATE );
// open file to store the data
char acstr[160];
FILE *fp;
sprintf(acstr, "CH2d_IE_dt%.2f_dx%.3f_Nt%ld_Nx%d_Ny%d_#f%.ld.dat",dt,dx,Nt,Nx,Ny,Nt/nframes);
After this preamble, I initialise my function h(x,y) with a uniform random noise, and I also take the FT of it. I set the imaginary part of h(x,y), which is dh[i*Ny+j][1] in the code, to 0, since it is a real function. Then I calculate the wavevectors qx and qy, and with them, I compute the linear operator of my equation in Fourier space, which is Linft in the code. I consider only the - fourth derivative of h as the linear term, so that the FT of the linear term is simply -q^4... but again, I don't want to go into the details of my integration method. The question is not about it.
// generate h(x,y) at initial time
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
rn = (double) rand()/RAND_MAX; // extract a random number between 0 and 1
dh[i*Ny+j][0] = drift-2.0*drift*rn; // shift of +-drift
dh[i*Ny+j][1] = 0.0;
}
}
// execute plan for the first time
fftw_execute (FT_h_hft);
// calculate wavenumbers
for (i = 0; i < Nx; i++) { qx[i] = 2.0*i*pi/(Nx*dx); }
for (i = 0; i < Ny; i++) { qy[i] = 2.0*i*pi/(Ny*dx); }
for (i = 1; i < Nx/2; i++) { qx[Nx-i] = -qx[i]; }
for (i = 1; i < Ny/2; i++) { qy[Ny-i] = -qy[i]; }
// calculate the FT of the linear operator
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Q2[i*Ny+j] = qx[i]*qx[i] + qy[j]*qy[j];
Linft[i*Ny+j] = -Q2[i*Ny+j]*Q2[i*Ny+j];
}
}
Then, finally, it comes the time loop. Essentially, what I do is the following:
Every once in a while, I save the data to a file and print some information on the terminal. In particular, I print the highest value of the FT of the Nonlinear term. I also check if h(x,y) is diverging to infinity (it shouldn't happen!),
Calculate h^3 in direct space (that is simply dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0]). Again, the imaginary part is set to 0,
Take the FT of h^3,
Obtain the complete Nonlinear term in reciprocal space (that is N[h_q] in the IE algorithm written above) by computing -q^2*(FT[h^3] - FT[h]). In the code, I am referring to the lines Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]) and the one below, for the imaginary part. I do this because:
Advance in time using the IE method, transform back in direct space, and then normalise.
Here is the code:
for(nt = 0; nt < Nt; nt++) {
if((nt % nframes)== 0) {
printf("%.0f %%\n",((double)nt/(double)Nt)*100);
printf("Nonlft %.15f \n",Nonlft[(Nx/2)*(Ny/2)][0]);
// write data to file
fp = fopen(acstr,"a");
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
fprintf(fp, "%4d %4d %.6f\n", i, j, dh[i*Ny+j][0]);
}
}
fclose(fp);
}
// check if h is going to infinity
if (isnan(dh[1][0])!=0) {
printf("crashed!\n");
return 0;
}
// calculate nonlinear term h^3 in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
}
}
// Implicit Euler scheme in Fourier space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dhft[i*Ny+j][0] = (dhft[i*Ny+j][0] + dt*Nonlft[i*Ny+j][0])/(1.0 - dt*Linft[i*Ny+j]);
dhft[i*Ny+j][1] = (dhft[i*Ny+j][1] + dt*Nonlft[i*Ny+j][1])/(1.0 - dt*Linft[i*Ny+j]);
}
}
// transform h back in direct space
fftw_execute (IFT_hft_h);
// normalize
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dh[i*Ny+j][0] = dh[i*Ny+j][0] / (double) (Nx*Ny);
dh[i*Ny+j][1] = dh[i*Ny+j][1] / (double) (Nx*Ny);
}
}
}
Last part of the code: empty the memory and destroy FFTW plans.
// terminate the FFTW3 plan and free memory
fftw_destroy_plan (FT_h_hft);
fftw_destroy_plan (FT_Nonl_Nonlft);
fftw_destroy_plan (IFT_hft_h);
fftw_cleanup();
fftw_free(dh);
fftw_free(Nonl);
fftw_free(qx);
fftw_free(qy);
fftw_free(Q2);
fftw_free(Linft);
fftw_free(dhft);
fftw_free(Nonlft);
return 0;
}
If I run this code, I obtain the following output:
0 %
Nonlft 0.0000000000000000000
1 %
Nonlft -0.0000000000001353512
2 %
Nonlft -0.0000000000000115539
3 %
Nonlft 0.0000000001376379599
...
69 %
Nonlft -12.1987455309071730625
70 %
Nonlft -70.1631962517720353389
71 %
Nonlft -252.4941743351609204637
72 %
Nonlft 347.5067875825179726235
73 %
Nonlft 109.3351142318568633982
74 %
Nonlft 39933.1054502610786585137
crashed!
The code crashes before reaching the end and we can see that the Nonlinear term is diverging.
Now, the thing that doesn't make sense to me is that if I change the lines in which I calculate the FT of the Nonlinear term in the following way:
// calculate nonlinear term h^3 -h in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
}
}
Which means that I am using this definition:
instead of this one:
Then the code is perfectly stable and no divergence happens! Even for billions of time steps! Why does this happen, since the two ways of calculating Nonlft should be equivalent?
Thank you very much to anyone who will take the time to read all of this and give me some help!
EDIT: To make things even more weird, I should point out that this bug does NOT happen for the same system in 1D. In 1D both methods of calculating Nonlft are stable.
EDIT: I add a short animation of what happens to the function h(x,y) just before crashing. Also: I quickly re-wrote the code in MATLAB, which uses Fast Fourier Transform functions based on the FFTW library, and the bug is NOT happening... the mystery deepens.
I solved it!!
The problem was the calculation of the Nonl term:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
That needs to be changed to:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -3.0*dh[i*Ny+j][0]*dh[i*Ny+j][1]*dh[i*Ny+j][1];
Nonl[i*Ny+j][1] = -dh[i*Ny+j][1]*dh[i*Ny+j][1]*dh[i*Ny+j][1] +3.0*dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][1];
In other words: I need to consider dh as a complex function (even though it should be real).
Basically, because of stupid rounding errors, the IFT of the FT of a real function (in my case dh), is NOT purely real, but will have a very small imaginary part. By setting Nonl[i*Ny+j][1] = 0.0 I was completely ignoring this imaginary part.
The issue, then, was that I was recursively summing FT(dh), dhft, and an object obtained using the IFT(FT(dh)), this is Nonlft, but ignoring the residual imaginary parts!
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
Obviously, calculating Nonlft as dh^3 -dh and then doing
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
Avoided the problem of doing this "mixed" sum.
Phew... such a relief! I wish I could assign the bounty to myself! :P
EDIT: I'd like to add that, before using the fftw_plan_dft_2d functions, I was using fftw_plan_dft_r2c_2d and fftw_plan_dft_c2r_2d (real-to-complex and complex-to-real), and I was seeing the same bug. However, I suppose that I couldn't have solved it if I didn't switch to fftw_plan_dft_2d, since the c2r function automatically "chops off" the residual imaginary part coming from the IFT. If this is the case and I'm not missing something, I think that this should be written somewhere on the FFTW website, to prevent users from running into problems like this. Something like "r2c and c2r transforms are not good to implement pseudospectral methods".
EDIT: I found another SO question that addresses exactly the same problem.
I am trying to parallelise gaussian elimination with pivoting using OpenMP.
Below is the relevant section of the code that I wrote:
struct timeval tvBegin, tvEnd;
gettimeofday(&tvBegin, NULL);
for (k=1; k<=n-1; ++k) {
amax = (double) fabs(a[k][k]) ;
m = k;
for (i=k+1; i<=n; i++){ /* Find the row with largest pivot */
xfac = (double) fabs(a[i][k]);
if(xfac > amax) {amax = xfac; m=i;}
}
if(m != k) { /* Row interchanges */
rowx = rowx+1;
temp1 = b[k];
b[k] = b[m];
b[m] = temp1;
for(j=k; j<=n; j++) {
temp = a[k][j];
a[k][j] = a[m][j];
a[m][j] = temp;
}
}
#pragma omp parallel for private(i,j)
for (i=k+1; i<=n; ++i) {
xfac = a[i][k]/a[k][k];
for (j=k+1; j<=n; ++j) {
a[i][j] = a[i][j]-xfac*a[k][j];
}
b[i] = b[i]-xfac*b[k];
} matrix_print_off (n, n, a);}
}
gettimeofday(&tvEnd, NULL);
printf("\nTime elapsed in ms: %d\n", diff_ms(tvEnd, tvBegin));
I tested this code with 1000*1000 matrix. The average time taken for running this code (measured via diff_ms) on a 4 core machine is coming out to be the same (2142ms) as the sequential version (without pragmas) of this code. Since there is immense parallelisation happening here, this shouldn't be the case. Could you please let me know where did I go wrong?
For reference, I have also attached the diff_ms function below.
int diff_ms(struct timeval t1, struct timeval t2)
{
return (((t1.tv_sec - t2.tv_sec) * 1000) +
(t1.tv_usec - t2.tv_usec)/1000);
}
Thanks!
Inside your parallel section, you have matrix_print_off(). Assuming your print function is thread safe, this will significantly reduce the amount of parallelism you can achieve. Additionally, if matrix_print_off() uses blocking IO, then this function's time may dominate the rest of your function.