OPENMP - Parallelize Schwarz algorithm with preconditions - c

I need to parallelize the Schwarz algorithm right bellow but I do not know how to deal with the precondition and the fact there are nested loops.
I have to use OpenMP or MPI.
void ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2, double* restrict Ax, double w)
{
#define AX(i,j,k) (Ax[((k)*n+(j))*n+(i)])
int i, j, k;
double xx, xn, xe, xu;
for (k = k1; k < k2; ++k) {
for (j = j1; j < j2; ++j) {
for (i = i1; i < i2; ++i) {
xx = AX(i,j,k);
xn = (i > 0) ? AX(i-1,j,k) : 0;
xe = (j > 0) ? AX(i,j-1,k) : 0;
xu = (k > 0) ? AX(i,j,k-1) : 0;
AX(i,j,k) = (xx+xn+xe+xu)/6*w;
}
}
}
#undef AX
}
Taking account that each loop use values from the loop before, how to parallelize this function to get the best time.
I already tried to parallelize loops two by two or by splitting in blocks (like Stencil Jacobi 3D) but without success...
Thank you very much !

Unfortunately, the inter-loop data dependency limits the amount of parallelism you can obtain in your nested loops.
You can use tasks with dependences, which will be the easiest approach. OpenMP runtime library will take care of the scheduling and you focus only on your algorithm. Another good side is that there is no synchronization at the end of any loop, but only between dependent parts of the code.
#pragma omp parallel
#pragma omp single
for (int k = 0; k < k2; k += BLOCK_SIZE) {
for (int j = 0; j < j2; j += BLOCK_SIZE) {
for (int i = 0; i < i2; i += BLOCK_SIZE) {
#pragma omp task depend (in: AX(i-1,j,k),AX(i,j-1,k),AX(i,j,k-1)) \
depend (out: AX(i,j,k))
// your code here
}
}
}
Tasks are sometimes a bit more expensive than parallel loops (depending on workload and synchronization granularities), so another alternative is the wavefront parallelization pattern, which basically transforms the iteration space so that the elements in the inner loop are independent between each other (so you can use parallel for there).
If you want to either approach, I strongly suggest you to turn your algorithm into a blocking one: unroll your 3-nested loops to do the computation in two stages:
Iterate among fixed sized sized blocks/cubes (let's call your new induction variables ii, jj and kk).
For each block, call the original serial version of your loop.
The goal of blocking is to increase the granularity of the parallel part, so that the parallelization overhead is not as noticeable.
Here is some pseudocode for the blocking part:
#define min(a,b) ((a)<(b)?(a):(b))
// Inter block iterations
for (int kk = 0; kk < k2; kk += BLOCK_SIZE) {
for (int jj = 0; jj < j2; jj += BLOCK_SIZE) {
for (int ii = 0; ii < i2; ii += BLOCK_SIZE) {
// Intra block iterations
for (int k = kk; k < min(k2,k+BLOCK_SIZE); k++) {
for (int j = jj; j < min(j2,j+BLOCK_SIZE); j++) {
for (int i = ii; i < min(i2,i+BLOCK_SIZE); i++) {
// Your code goes here
}
}
}
}
}
}
In the case of the wavefront parallelization, the last step is turning the outer loops (inter block iterations) into a wavefront, so that you iterate over the elements that are not dependent between each other. In 3D iteration spaces, it is basically a diagonal plane that advances from (0,0,0) to (i2,j2,k2). Something like the one highlighted in red, in the image below.
I'm going to put an example of the 2D wavefront, because it is easier to understand.
#define min(a,b) ((a)<(b)?(a):(b))
#pragma omp parallel
for (int d = 1; d < i2+j2; d++ ) {
int i = min(d,i2) - 1;
int j = 0;
// Iterations in the inner loop are independent
// Implicit thread barrier (synchronization) at the end of the loop
#pragma omp for
for ( ; i >= 0 && j < min(d,j2); i--, j++) {
// your code here
}
}

Related

Clang OpenMP. Find max value in matrix N x N

I need to find max value in matrix using OpenMP. It is my first experience with OpenMP, previously I did this task using pthreads.
I wrote this code but it does not work:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void MatrixFIller(int nrows, int* m) {
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
*(m + i * nrows + j) = rand() % 200;
}
}
};
#define dimension 9
#define number_of_threads 4
int main() {
srand(time(NULL));
int matrix[dimension][dimension];
int local_max=-1;
int final_max=-1;
int j = 0;
MatrixFIller(dimension, &matrix[0][0]);
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
printf("%d\t", matrix[i][j]);
}
printf("\n");
}
omp_set_num_threads(number_of_threads);
#pragma omp parallel private(local_max)
{
#pragma omp for
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > local_max) {
local_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
#pragma omp critical
if (local_max > final_max) { final_max = local_max; };
};
printf("Max value of matrix with dimension %d is %d", dimension, final_max);
};
The idea is that in pragma for each thread finds local max and after that it is compared with global max value in pragma critical. Why it does not correct? Thanks!
When entering the parallel region, local_max gets unitialized: the private clause creates variables that are local to each thread and that's it, they are not initialized to any value. If you want them to be initialized with the content of local_max had before entering the parallel region, you have to use the firstprivate clause instead.
However, it would actually be better to declare (and initialize) local_max inside the parallel region.
Also, you may have a look at the reduction clause (with the max option), which will make the code even simpler:
#pragma omp parallel for reduction(max:final_max)
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > final_max) {
final_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
EDIT
Following Laci's comment about the incorrectness of the arithmetic: all of your indeces calculations look correct but are not easy to read. Since you have from the begining a 2D array it is simpler to set two loops. And possibly tell OpenMP to parallelize them both using the collapse clause (and by the way, and as far as possible, declare the loop indeces within the for(): this avoids always wondering which ones should be declared as private or not):
#pragma omp parallel for reduction(max:final_max) collapse(2)
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
if (matrix[i][j] > final_max) {
final_max = matrix[i][j];
}
}
}

Parallel sections code with nested loops in openmp

I made this parallel code to share the iterations like first and last, fisrst+1 and last-1,... But I don't know how to improve the code in every one of the two parallel sections because I have an inner loop in the sections and I can't think of any way to simplify it, thanks.
This isn't about which values are stored in x or y, I use this sections design because the requisite is execute the iterations from 0 to N like: 0 N, 1 N-1, 2 N-2 but I would like to know if I can optimize the inner loops maintaining this model
int x = 0, y = 0,k,i,j,h;
#pragma omp parallel private(i, h) reduction(+:x, y)
{
#pragma omp sections
{
#pragma omp section
{
for (i=0; i<N/2; i++)
{
C[i] = 0;
for (j=0; j<N; j++)
{
C[i] += MAT[i][j] * B[j];
}
x += C[i];
}
}
#pragma omp section
{
for (h=N-1; h>=N/2; h--)
{
C[h] = 0;
for (k=0; k<N; k++)
{
C[h] += MAT[h][k] * B[k];
}
y += C[h];
}
}
}
}
x = x + y;
Using sections seems like the wrong approach. A pragma omp for seems more appropriate. Also note that you forgot to declare j private.
int x = 0, y = 0,k,i,j;
#pragma omp parallel private(i,j) reduction(+:x, y)
{
# pragma omp for nowait
for(i=0; i<N/2; i++) {
// local variable to make the life easier on the compiler
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
x += ci;
C[i] = ci;
}
# pragma omp for nowait
for(i=N/2; i < N; i++) {
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
y += ci;
C[i] = ci;
}
}
x = x + y;
Also, I'm not sure but if you just want x as your final output, you can simplify the code even further:
int x=0, i, j;
#pragma omp parallel for reduction(+:x) private(i,j)
for(i=0; i < N; ++i)
for(j=0; j < N; ++j)
x += MAT[i][j] * B[j];
The section construct is to distribute different tasks to different threads and each section block marks a different task so you will not be able to do that iterations in the order you want I answered you here:
Distribution of loop iterations between threads with a specific order
But I want to clarify that the requirement to use sections is that each block must be independent of the other blocks.
A section gets only one thread, so you can't make the loops parallel. How about
Make a parallel loop to N at the top level,
then inside each iteration use a conditional to decide whether to accumulate into x,y?
Although #Homer512 's solution looks correct to me too.

How to reduce exection time of FFT in a loop using OpenMP?

Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms

Computing entries of a matrix in OpenMP

I am very new to openMP, but am trying to write a simple program that generates the entries of matrix in parallel, namely for the N by M matrix A, let A(i,j) = i*j. A minimal example is included below:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc,
char **argv)
{
int i, j, N, M;
N = 20;
M = 20;
int* A;
A = (int*) calloc(N*M, sizeof(int));
// compute entries of A in parallel
#pragma omp parallel for shared(A)
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
A[i*M + j] = i*j;
}
}
// print parallel results
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
printf("%d ", A[i*M + j]);
}
printf("\n");
}
free(A);
return 0;
}
The results are not always correct. In theory, I am only parallelizing the outer loop, and each iteration of the for loop does not modify the entries that the other iterations will modify. But I am not sure how to translate this to openMP. When doing a similar procedure for a vector array (i.e. just one for loop), there seems to be no issue, e.g.
#pragma omp parallel for
for (i = 0; i < N; ++i)
{
v[i] = i*i;
}
Can someone explain to me how to fix this?
The issue in this case is that j is shared between threads which messes with the control flow of the inner loop. By default variables declared outside of a parallel region are shared whereas variables declared inside of a parallel region are private.
Follow the general rule to declare variables as locally as possible. In the for loop this means:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {
This makes reasoning about your code much easier - and OpenMP code mostly correct by default. (Note A is shared by default because it is defined outside).
Alternatively you can manually specify private(i,j) shared(A) - this is more explicit and can help beginners. However it creates redundancy and can also be dangerous: private variables are uninitialized even if they had a valid value outside of the parallel region. Therefore I strongly recommend the implicit default approach unless necessary for advanced usage.
According to e.g. this
http://supercomputingblog.com/openmp/tutorial-parallel-for-loops-with-openmp/
The declaration of variables outside of a parallelized part is dangerous.
It can be defused by explicitly making the loop variable of the inner loop private.
For that, change this
#pragma omp parallel for shared(A)
to
#pragma omp parallel for private(j) shared(A)

nested loops, inner loop parallelization, reusing threads

Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.
The problem:
We have an n*m matrix, and we want to copy elements from previous row as in the following code:
for (i = 1; i < n; i++)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
Approach:
Outer loop iterations have to be executed in order, they would be executed sequentially.
Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.
#pragma omp parallel private(j)
{
for (i = 1; i < n; i++)
{
#pragma omp for scheduled(dynamic)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain.
I am looking to solution for the scenario above, even if I had to use some workaround.
I am adding my actual code. This is is actually slower than seq. version. Please review:
/* load input */
for (i = 1; i <= n; i++)
scanf ("%d %d", &in[i][W], &in[i][V]);
/* init */
for (i = 0; i <= wc; i++)
a[0][i] = 0;
/* compute */
#pragma omp parallel private(i,w)
{
for(i = 1; i <= n; ++i) // 1 000 000
{
j=i%2;
jn = j == 1 ? 0 : 1;
#pragma omp for
for(w = 0; w <= in[i][W]; w++) // 1000
a[j][w] = a[jn][w];
#pragma omp for
for(w = in[i][W]+1; w <= wc; w++) // 350 000
a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
}
}
As for measuring, I am using something like this:
double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;
To sum up the parallelization in OpenMP for this particular case: It is not worth it.
Why?
Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body.
Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was.
I also found out, there is no real performance gain in such construct:
#pragma omp parallel private(i,j)
{
for (i = 1; i < n; i++)
{
#pragma omp for
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
because it's performance is similar to this one
for (i = 1; i < n; i++)
{
#pragma omp parallel for private(j)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/
Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.

Resources