How to reduce exection time of FFT in a loop using OpenMP?

How to reduce exection time of FFT in a loop using OpenMP? - c

Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms

Related

Parallel code with OpenMP takes more time to execute than serial code

I'm trying to make this code to run in parallel. It's a chunk of code from a big project. I thought I started parallelizing slowly to see if there is a problem step by step (I don't know if that's a good tactic so please let me know).
double best_nearby(double delta[MAXVARS], double point[MAXVARS], double prevbest, int nvars)
{
double z[MAXVARS];
double minf, ftmp;
int i;
minf = prevbest;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for shared(nvars,point,z) private(i)
for (i = 0; i < nvars; i++)
z[i] = point[i];
for (i = 0; i < nvars; i++) {
z[i] = point[i] + delta[i];
ftmp = f(z, nvars);
if (ftmp < minf)
minf = ftmp;
else {
delta[i] = 0.0 - delta[i];
z[i] = point[i] + delta[i];
ftmp = f(z, nvars);
if (ftmp < minf)
minf = ftmp;
else
z[i] = point[i];
}
}
for (i = 0; i < nvars; i++)
point[i] = z[i];
return (minf);
}
NUM_THREADS is #defined
The function has some more lines but they are the same among the parallel and the serial.
It looks like the serial code takes on average 130s thus the parallel takes something like 400s. It baffles me that such a small change can lead up to so much increase in exe time. Any ideas on why this happens? Thank you in advance!
double f(double *x, int n){
double fv;
int i;
funevals++;
fv = 0.0;
for (i=0; i<n-1; i++) /* rosenbrock */
fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);
return fv;
}

Currently, you are not parallelizing much. You can start by parallelizing the f function since it looks computational demanding:
double f(double *x, int n){
..
double fv = 0.0;
#pragma omp parallel for reduction(+:fv)
for (int i=0; i<n-1; i++)
fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);
return fv;
}
Test and check the results. After that you can try to expand the scope of the parallelization to include also the outermost loop.

Possibly race condition issue when using openmp

The following function reads data from a file in loops and processes each loaded chunk at a time. To speed up this process, I thought to use openmp in the for loop so that this job is divided between the threads as the following:
void read_process(FILE *fp_read, double *centroids, int total) {
int i, j, c, dim = 16, chunk_size = 10000, num_itr;
double *buffer = calloc(total * dim, sizeof(double));
num_itr = total / chunk_size;
for (c = 0; c < total; ++c) {
fread(buffer, sizeof(double), chunk_size * dim, fp_read);
#pragma omp parallel private(i, j)
{
#pragma omp for
for (i = 0; i < chunk_size; i++) {
for (j = 0; j < dim; j++) {
#pragma omp atomic update
centroids[j] += buffer[i * dim + j];
}
}
}
}
free(buffer);
fclose(fp_read);
}
Without using openmp, my code works fine. However, adding #pragma section causes the code to stop and show the word Hangup in the terminal without further explanation of what was it hanged for. Some folks in StackOverflow answered other issues related to this error message that it is probably because of race condition but I think it won't be the case here because I am using atomic which serializes the access of the buffer. Am I right? Do you guys see an issue with my code? How can I enhance this code?
Thank you very much.

What you want to do is an array reduction. If you have a compiler that supports OpenMP 4.5 then you don't need to change your serial code. You can do
#pragma omp parallel for private (j) reduction(+:centroids[:dim])
for(i=0; i <chunck_size; i++) {
for(j=0; j < dim; j++) {
centroids[j] += buffer[i*dim +j];
}
}
Otherwise you can do the array reduction by hand. Here is one solution
#pragma omp parallel private(j)
{
double tmp[dim] = {0};
#pragma omp for
for(i=0; i < chunck_size; i++) {
for(j=0; j < dim; j++) {
tmp[j] += buffer[i*dim +j];
}
}
#pragma omp critical
for(int i=0; i < dim; i++) centroids[i] += tmp[i];
}
Your current solution is causing massive false sharing as each thread is writing to the same cache line. Both of the solutions above fix this problem by making private versions of centroid for each thread.
As long as dim << chunck_size then these are good solutions.

OPENMP - Parallelize Schwarz algorithm with preconditions

I need to parallelize the Schwarz algorithm right bellow but I do not know how to deal with the precondition and the fact there are nested loops.
I have to use OpenMP or MPI.
void ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2, double* restrict Ax, double w)
{
#define AX(i,j,k) (Ax[((k)*n+(j))*n+(i)])
int i, j, k;
double xx, xn, xe, xu;
for (k = k1; k < k2; ++k) {
for (j = j1; j < j2; ++j) {
for (i = i1; i < i2; ++i) {
xx = AX(i,j,k);
xn = (i > 0) ? AX(i-1,j,k) : 0;
xe = (j > 0) ? AX(i,j-1,k) : 0;
xu = (k > 0) ? AX(i,j,k-1) : 0;
AX(i,j,k) = (xx+xn+xe+xu)/6*w;
}
}
}
#undef AX
}
Taking account that each loop use values from the loop before, how to parallelize this function to get the best time.
I already tried to parallelize loops two by two or by splitting in blocks (like Stencil Jacobi 3D) but without success...
Thank you very much !

Unfortunately, the inter-loop data dependency limits the amount of parallelism you can obtain in your nested loops.
You can use tasks with dependences, which will be the easiest approach. OpenMP runtime library will take care of the scheduling and you focus only on your algorithm. Another good side is that there is no synchronization at the end of any loop, but only between dependent parts of the code.
#pragma omp parallel
#pragma omp single
for (int k = 0; k < k2; k += BLOCK_SIZE) {
for (int j = 0; j < j2; j += BLOCK_SIZE) {
for (int i = 0; i < i2; i += BLOCK_SIZE) {
#pragma omp task depend (in: AX(i-1,j,k),AX(i,j-1,k),AX(i,j,k-1)) \
depend (out: AX(i,j,k))
// your code here
}
}
}
Tasks are sometimes a bit more expensive than parallel loops (depending on workload and synchronization granularities), so another alternative is the wavefront parallelization pattern, which basically transforms the iteration space so that the elements in the inner loop are independent between each other (so you can use parallel for there).
If you want to either approach, I strongly suggest you to turn your algorithm into a blocking one: unroll your 3-nested loops to do the computation in two stages:
Iterate among fixed sized sized blocks/cubes (let's call your new induction variables ii, jj and kk).
For each block, call the original serial version of your loop.
The goal of blocking is to increase the granularity of the parallel part, so that the parallelization overhead is not as noticeable.
Here is some pseudocode for the blocking part:
#define min(a,b) ((a)<(b)?(a):(b))
// Inter block iterations
for (int kk = 0; kk < k2; kk += BLOCK_SIZE) {
for (int jj = 0; jj < j2; jj += BLOCK_SIZE) {
for (int ii = 0; ii < i2; ii += BLOCK_SIZE) {
// Intra block iterations
for (int k = kk; k < min(k2,k+BLOCK_SIZE); k++) {
for (int j = jj; j < min(j2,j+BLOCK_SIZE); j++) {
for (int i = ii; i < min(i2,i+BLOCK_SIZE); i++) {
// Your code goes here
}
}
}
}
}
}
In the case of the wavefront parallelization, the last step is turning the outer loops (inter block iterations) into a wavefront, so that you iterate over the elements that are not dependent between each other. In 3D iteration spaces, it is basically a diagonal plane that advances from (0,0,0) to (i2,j2,k2). Something like the one highlighted in red, in the image below.
I'm going to put an example of the 2D wavefront, because it is easier to understand.
#define min(a,b) ((a)<(b)?(a):(b))
#pragma omp parallel
for (int d = 1; d < i2+j2; d++ ) {
int i = min(d,i2) - 1;
int j = 0;
// Iterations in the inner loop are independent
// Implicit thread barrier (synchronization) at the end of the loop
#pragma omp for
for ( ; i >= 0 && j < min(d,j2); i--, j++) {
// your code here
}
}

nested loops, inner loop parallelization, reusing threads

Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.
The problem:
We have an n*m matrix, and we want to copy elements from previous row as in the following code:
for (i = 1; i < n; i++)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
Approach:
Outer loop iterations have to be executed in order, they would be executed sequentially.
Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.
#pragma omp parallel private(j)
{
for (i = 1; i < n; i++)
{
#pragma omp for scheduled(dynamic)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain.
I am looking to solution for the scenario above, even if I had to use some workaround.
I am adding my actual code. This is is actually slower than seq. version. Please review:
/* load input */
for (i = 1; i <= n; i++)
scanf ("%d %d", &in[i][W], &in[i][V]);
/* init */
for (i = 0; i <= wc; i++)
a[0][i] = 0;
/* compute */
#pragma omp parallel private(i,w)
{
for(i = 1; i <= n; ++i) // 1 000 000
{
j=i%2;
jn = j == 1 ? 0 : 1;
#pragma omp for
for(w = 0; w <= in[i][W]; w++) // 1000
a[j][w] = a[jn][w];
#pragma omp for
for(w = in[i][W]+1; w <= wc; w++) // 350 000
a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
}
}
As for measuring, I am using something like this:
double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;

To sum up the parallelization in OpenMP for this particular case: It is not worth it.
Why?
Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body.
Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was.
I also found out, there is no real performance gain in such construct:
#pragma omp parallel private(i,j)
{
for (i = 1; i < n; i++)
{
#pragma omp for
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
because it's performance is similar to this one
for (i = 1; i < n; i++)
{
#pragma omp parallel for private(j)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/
Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.

OpenMP with C program

I am having a hard time using OpenMP with C to parallelize this method. I was wondering if anyone could help and possibly tell me what is wrong with my parallelization of this method.
void blur(float **out, float **in) {
// assumes "padding" to avoid messy border cases
int i, j, r, c;
float tmp, term;
term = 1.0 / 157.0;
#pragma omp parallel num_threads(8)
#pragma omp for private(r,c)
for (i = 0; i < N-4; i++) {
for (j = 0; j < N-4; j++) {
tmp = 0.0;
for (r = 0; r < 5; r++) {
for (c = 0; c < 5; c++) {
tmp += in[i+r][j+c] * mask[r][c];
}
}
out[i+2][j+2] = term * tmp;
}
}
}

You shall either declare tmp inside the loop:
// at line 11:
float tmp = 0.0;
or specify tmp as a private variable:
// at line 7:
#pragma omp for private(r,c,tmp)
Or it would be treated like a shared variable among threads.