What is a right way to use task directive in OpenMP - c

I am trying to multiply two matrices using OpenMP task. This is a basic code:
long i, j, k;
for (i = 0; i < N; i ++)
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
So, I want to use task on column level and then I modified code like this:
long i, j, k;
#pragma omp parallel
{
#pragma omp single
{
for (i = 0; i < N; i ++)
#pragma omp task private(i, j, k)
{
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
When I run a program I get message like this:
Segmentation fault (core dumped)
Now, I know I'm missing some piece, but can't figure it what. Any idea?

private variables in OpenMP are not initialised and have random initial values. When the task executes, i would have random value and therefore probably lead to an out-of-bound access of c[] and a[].
firstprivate variables are similar to private, but have their initial value set to the value that the referenced variable had at the moment the construct is entered. In your case i has to be firstprivate and not private.
Also it is advisable that i is made private in the parallel region for a small performance increase. Thus the final code should look like this (with all variable sharing classes written explicitly and private variables declared in their use scopes):
#pragma omp parallel shared(a, b, c)
{
#pragma omp single
{
long i;
for (i = 0; i < N; i++)
#pragma omp task shared(a, b, c) firstprivate(i)
{
int j, k;
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}

Related

parallelize nested for loop

A is a 2D array, n is the matrix size and we are dealing with a square matrix. threads are number of threads the user input
#pragma omp parallel for shared(A,n,k) private(i) schedule(static) num_threads(threads)
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
A[i][k] /= A[k][k];
}
for(i = k + 1; i < n; i++) {
long long int j;
const double Aik = A[i][k];
for(j = k + 1; j < n; j++) {
A[i][j] -= Aik * A[k][j];
}
}
}
i tried using collapse but failed the error it was showing was work-sharing region may not be closely nested inside of work-sharing, ‘loop’, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘task loop’ region.
After what i though was correct, the time increased as i executed the code with more threads.
I tired using collapse. this is the output:
26:17: error: collapsed loops not perfectly nested before ‘for’
26 | for(i = k + 1; i < n; i++) {
This is LU-Decomposition

Wrong result with OpenMP to parallelize GEMM

I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.
void omp2(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
#pragma omp parallel for
for (int ki = 0; ki < k; ++ki) {
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
for (int ki = 0; ki < k; ++ki) {
#pragma omp parallel for
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
The problem is that there is a race condition in this line:
C[i * n + j] += ...
Different threads can read and write the same memory location (C[i * n + j]) simultaneously, which causes data race. In omp2 this data race can occur, but not in omp3.
The solution is (as suggested by #Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n + j] is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):
#pragma omp parallel for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
double sum=0;
for (int ki = 0; ki < k; ++ki) {
sum += A[i * k + ki] * B[ki * n + j];
}
C[i * n + j] +=sum;
}
}
Note that you can use collapse(2) clause, which may increase the performance.

Clarification on "region cannot be closely nested inside 'parallel for' region"

I am trying to understand how reduction works in OpenMP.
I have this simple code that involves reduction.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int N = 100;
int M = 200;
int O = 300;
double r2() {
return ((double) rand() / (double) RAND_MAX);
}
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma omp parallel for collapse(2)
{
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
}
}
I get this error message
Blockquote test.cc:30:1: error: region cannot be closely nested inside 'parallel for' region; perhaps you forget to enclose 'omp for' directive into a parallel region?
#pragma omp for reduction(+:S)
^
It complies if I do
#pragma omp parallel for reduction(+:S)
Is this the right way to do a nested loop?
EDIT:
Making a change in the original question. I want the parallel and sequential code to have the same result.
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}
IMPORTANT TL;DR rand is not thread safe:
From the rand man page:
The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.
for multithread code use (for instance) rand_r instead.
I am trying to understand how reduction works in OpenMP.
For the sake of argument let us assume that r2() would yield always the same values.
When one's code has multiple threads concurrently modifying a certain variable, and the code looks like the following:
double S = 0;
#pragma omp parallel
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
one has a race condition on the updates of the variable S. To solve it, one can use the OpenMP reduction clause, which from the OpenMP standard one can read:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing
constructs, a private copy of each list item is created, one for each
implicit task, as if the private clause had been used. (...) The
private copy is then initialized as specified above. At the end of the
region for which the reduction clause was specified, the original list
item is updated by combining its original value with the final value
of each of the private copies, using the combiner of the specified
reduction-identifier.
In that case the code would look like the following:
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
However, in your full code
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
you first divide the iterations of the two outer loops using the #pragma omp for collapse(2), and then you try to divide again the iterations of the inner most loop with a different clause #pragma omp for and this is not allowed.
Is this the right way to do a nested loop?
You could do the following parallelization:
#pragma omp parallel for collapse(2) firstprivate (S)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
There is no race-condition because the variable S is private. Moreover, in this case, since the iterations of the two outermost loops are divided among threads, each thread has a unique pair of m and n iterations, consequently each thread will access a unique position of the array K during its access K[m * N + n].
But the problem is that a version that parallelizes the two outer loops will not yield the same results as its sequential counterpart. This is the case because
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
adds an implicitly dependency in all the iterations of the three loops.
The value of S is explicitly dependent on the order on which the iterations m, n and o are executed. Therefore, if you divide the iterations of those loops among threads the values of S of a given m and n will be not be the same if the code is executed sequentially or in parallel. Notwithstanding, this can be solved by only parallelizing the inner most loop and reducing the variable S:
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel for reduction(+:S)
for (int o = 0; o < O; o++) {
S += r2() - 0.25;
}
K[m * N + n] = S;
}
}
All of this is (of course) important if you care about the values of S, because one might argue that since you are using a function that yields random values, keeping the order of the values of S is not paramount.
The versions with the thread-safe random generator
Version 1
#pragma omp parallel
{
unsigned int myseed = omp_get_thread_num();
#pragma omp for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
S += r - 0.25;
}
K[m * N + n] = S;
}
}
}
Version 2
double *K = (double*) calloc(M * N, sizeof(double));
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel
{
unsigned int myseed = omp_get_thread_num();
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
S += r - 0.25;
}
}
K[m * N + n] = S;
}
}
EDIT:
Making a change in the original question. I want the parallel and
sequential code to have the same result.
Instead of :
#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}
do:
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma omp parallel for reduction(+:S)
for (int o = 0; o < O; o++) {
S += o;
}
K[m * N + n] = S;
}
}

OpenMP nested for loop without collapse directive

I'm trying to implement matrix multiplication with OpenMP.
As i found that usually collapse directive is used in nested for loop.
So i also used collapse as below. (this code is belong to feedforward in neural network)
for (i = 0; i < net->num_layer-1; i++) {
#pragma omp parallel for num_threads(thread) private(j) collapse(2)
for (j = 0; j < net->mini_batch_size; j++) {
for (k = 0; k < net->layer_size[i+1]; k++) {
#pragma omp simd reduction(+:sum)
for (l = 0; l < net->layer_size[i]; l++) {
sum = sum + NEURON(net, i, j, l) * WEIGHT(net, i, l, k);
}
ZS(net, i+1, j, k) = sum + BIAS(net, i+1, k);
NEURON(net, i+1, j, k) = sigmoid(ZS(net, i+1, j, k));
sum = 0.0;
}
}
}
But What i want to know is is there any way to implement parallelize without collapse? I also tried like below but the performance is under my expectation.
omp_set_nested(1);
for (i = 0; i < net->num_layer-1; i++) {
#pragma omp parallel for num_threads(net->mini_batch_size) private(j)
for (j = 0; j < net->mini_batch_size; j++) {
#pragma omp parallel for num_threads(net->layer_size[i+1]) private(j, k)
for (k = 0; k < net->layer_size[i+1]; k++) {
#pragma omp simd reduction(+:sum)
for (l = 0; l < net->layer_size[i]; l++) {
sum = sum + NEURON(net, i, j, l) * WEIGHT(net, i, l, k);
}
ZS(net, i+1, j, k) = sum + BIAS(net, i+1, k);
NEURON(net, i+1, j, k) = sigmoid(ZS(net, i+1, j, k));
sum = 0.0;
}
}
}
In the thread variable is 100 constant. it would be optimized later but i fixed it now. And also, the result of performance is that with collpase, 0.75sec, and without collapse, 25sec.
So, anyway please let me know is there any way to implement nested loop without collpase which results better performance than collapse.

OpenMP parallel code slower

I have two loops which I am parallelizing
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?
The iterations of your parallel outer loops share the index variables (j and k) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, i.e., your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.
What is worse, is that, because of this, your code contains race conditions. As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))
What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k. You can achieve this either by declaring these variables within the scope of the parallel loops:
int i;
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j1, k1; /* explicit local copies */
for (j1 = 0; j1 < nj; j1++) {
C[i][j1] = 0;
for (k1 = 0; k1 < nk; ++k1)
C[i][j1] += A[i][k1] * B[k1][j1];
}
}
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j2, k2; /* explicit local copies */
for (j2 = 0; j2 < nl; j2++) {
E[i][j2] = 0;
for (k2 = 0; k2 < nj; ++k2)
E[i][j2] += C[i][k2] * D[k2][j2];
}
}
or otherwise declaring them as private in your loop pragmas:
int i, j, k;
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.
To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if-clauses):
#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}

Resources