Clarification on "region cannot be closely nested inside 'parallel for' region" - c

I am trying to understand how reduction works in OpenMP.
I have this simple code that involves reduction.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int N = 100;
int M = 200;
int O = 300;
double r2() {
return ((double) rand() / (double) RAND_MAX);
}
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma omp parallel for collapse(2)
{
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
}
}
I get this error message
Blockquote test.cc:30:1: error: region cannot be closely nested inside 'parallel for' region; perhaps you forget to enclose 'omp for' directive into a parallel region?
#pragma omp for reduction(+:S)
^
It complies if I do
#pragma omp parallel for reduction(+:S)
Is this the right way to do a nested loop?
EDIT:
Making a change in the original question. I want the parallel and sequential code to have the same result.
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}

IMPORTANT TL;DR rand is not thread safe:
From the rand man page:
The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.
for multithread code use (for instance) rand_r instead.
I am trying to understand how reduction works in OpenMP.
For the sake of argument let us assume that r2() would yield always the same values.
When one's code has multiple threads concurrently modifying a certain variable, and the code looks like the following:
double S = 0;
#pragma omp parallel
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
one has a race condition on the updates of the variable S. To solve it, one can use the OpenMP reduction clause, which from the OpenMP standard one can read:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing
constructs, a private copy of each list item is created, one for each
implicit task, as if the private clause had been used. (...) The
private copy is then initialized as specified above. At the end of the
region for which the reduction clause was specified, the original list
item is updated by combining its original value with the final value
of each of the private copies, using the combiner of the specified
reduction-identifier.
In that case the code would look like the following:
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
However, in your full code
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
you first divide the iterations of the two outer loops using the #pragma omp for collapse(2), and then you try to divide again the iterations of the inner most loop with a different clause #pragma omp for and this is not allowed.
Is this the right way to do a nested loop?
You could do the following parallelization:
#pragma omp parallel for collapse(2) firstprivate (S)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
There is no race-condition because the variable S is private. Moreover, in this case, since the iterations of the two outermost loops are divided among threads, each thread has a unique pair of m and n iterations, consequently each thread will access a unique position of the array K during its access K[m * N + n].
But the problem is that a version that parallelizes the two outer loops will not yield the same results as its sequential counterpart. This is the case because
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
adds an implicitly dependency in all the iterations of the three loops.
The value of S is explicitly dependent on the order on which the iterations m, n and o are executed. Therefore, if you divide the iterations of those loops among threads the values of S of a given m and n will be not be the same if the code is executed sequentially or in parallel. Notwithstanding, this can be solved by only parallelizing the inner most loop and reducing the variable S:
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
All of this is (of course) important if you care about the values of S, because one might argue that since you are using a function that yields random values, keeping the order of the values of S is not paramount.
The versions with the thread-safe random generator
Version 1
#pragma omp parallel
{
unsigned int myseed = omp_get_thread_num();
#pragma omp for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
S += r - 0.25;
}
K[m * N + n] = S;
}
}
}
Version 2
double *K = (double*) calloc(M * N, sizeof(double));
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel
{
unsigned int myseed = omp_get_thread_num();
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
S += r - 0.25;
}
}
K[m * N + n] = S;
}
}
EDIT:
Making a change in the original question. I want the parallel and
sequential code to have the same result.
Instead of :
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}
do:
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}

Related

parallelize nested for loop

A is a 2D array, n is the matrix size and we are dealing with a square matrix. threads are number of threads the user input
#pragma omp parallel for shared(A,n,k) private(i) schedule(static) num_threads(threads)
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
A[i][k] /= A[k][k];
}
for(i = k + 1; i < n; i++) {
long long int j;
const double Aik = A[i][k];
for(j = k + 1; j < n; j++) {
A[i][j] -= Aik * A[k][j];
}
}
}
i tried using collapse but failed the error it was showing was work-sharing region may not be closely nested inside of work-sharing, ‘loop’, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘task loop’ region.
After what i though was correct, the time increased as i executed the code with more threads.
I tired using collapse. this is the output:
26:17: error: collapsed loops not perfectly nested before ‘for’
26 | for(i = k + 1; i < n; i++) {
This is LU-Decomposition

Wrong result with OpenMP to parallelize GEMM

I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.
void omp2(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
#pragma omp parallel for
for (int ki = 0; ki < k; ++ki) {
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
for (int ki = 0; ki < k; ++ki) {
#pragma omp parallel for
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
The problem is that there is a race condition in this line:
C[i * n + j] += ...
Different threads can read and write the same memory location (C[i * n + j]) simultaneously, which causes data race. In omp2 this data race can occur, but not in omp3.
The solution is (as suggested by #Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n + j] is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):
#pragma omp parallel for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
double sum=0;
for (int ki = 0; ki < k; ++ki) {
sum += A[i * k + ki] * B[ki * n + j];
}
C[i * n + j] +=sum;
}
}
Note that you can use collapse(2) clause, which may increase the performance.

Best approach to parallelize BW and FW algorithms

I have implemented the BW and FW algorithms to solve L and U triangular matrix.
The algorithm that I implement run very fast in a serial way, but I can not figure out if this is the best method to parallelize it.
I think that I have taken into account every possible data race (on alpha), am I right?
void solveInverse (double **U, double **L, double **P, int rw, int cw) {
double **inverseA = allocateMatrix(rw,cw);
double* x = allocateArray(rw);
double* y = allocateArray(rw);
double alpha;
//int i, j, t;
// Iterate along the column , so at each iteration we generate a column of the inverse matrix
for (int j = 0; j < rw; j++) {
// Lower triangular solve Ly=P
y[0] = P[0][j];
#pragma omp parallel for reduction(+:alpha)
for (int i = 1; i < rw; i++) {
alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
// Upper triangular solve Ux=P
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp parallel for reduction(+:alpha)
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
freeMemory(inverseA,rw);
free(x);
free(y);
}
After a private discussion with the user dreamcrash, we have come to the solution proposed in his comments, creating a couple of vector x and y for each thread that will work indipendently on a single column.
After a discussion with the OP on the comments (that were removed afterwards), we both came to the conclusion that:
You do not need to reduce the alpha variable, because outside the first parallel region it is initialized again to zero. Instead, make the alpha variable private.
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
and the same applies to the second parallel region as well.
#pragma omp parallel for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
Instead of having one parallel region per j iteration. You can extract the parallel region to encapsulate the entire outermost loop, and use #pragma omp for instead of #pragma omp parallel for. Notwithstanding, although with this approach we reduced the number of parallel regions created from rw to only 1, the speedup achieved with this optimization should not be that significant, because an efficient OpenMP implementation will use a pool of threads, where the threads are initialized on the first parallel region but reused on the next parallel regions. Consequently, saving on the overhead of creating and destroying threads.
#pragma omp parallel
{
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
#pragma omp for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
#pragma omp for
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
}
I have shown you this code transformations so that you could see some potential tricks that you can use on other future parallelizations. Unfortunately, as it is that parallelization will not work.
Why?
Let us look at the first loop:
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
there is a dependency between y[t] being read in alpha += L[i][t] * y[t]; and y[i] being written in y[i] = P[i][j] - alpha;.
So what you can do instead is to parallelize the outermost loop (i.e., assign each column to the threads) and create separate x and y arrays for each thread so that there is no race-conditions during the updates/reads of those arrays.
#pragma omp parallel
{
double* x = allocateArray(rw);
double* y = allocateArray(rw);
#pragma omp for
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
for (int i = rw-2; i >= 0; i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
free(x);
free(y);
}

Not responding during exeuting basic OpenMP (C) program

I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}
First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range

Why is my program generating random results when I nest it?

I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.

Resources