OpenMP parallel code slower

OpenMP parallel code slower - c

I have two loops which I am parallelizing
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?

The iterations of your parallel outer loops share the index variables (j and k) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, i.e., your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.
What is worse, is that, because of this, your code contains race conditions. As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))
What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k. You can achieve this either by declaring these variables within the scope of the parallel loops:
int i;
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j1, k1; /* explicit local copies */
for (j1 = 0; j1 < nj; j1++) {
C[i][j1] = 0;
for (k1 = 0; k1 < nk; ++k1)
C[i][j1] += A[i][k1] * B[k1][j1];
}
}
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j2, k2; /* explicit local copies */
for (j2 = 0; j2 < nl; j2++) {
E[i][j2] = 0;
for (k2 = 0; k2 < nj; ++k2)
E[i][j2] += C[i][k2] * D[k2][j2];
}
}
or otherwise declaring them as private in your loop pragmas:
int i, j, k;
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.
To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if-clauses):
#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}

Related

parallelize nested for loop

A is a 2D array, n is the matrix size and we are dealing with a square matrix. threads are number of threads the user input
#pragma omp parallel for shared(A,n,k) private(i) schedule(static) num_threads(threads)
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
A[i][k] /= A[k][k];
}
for(i = k + 1; i < n; i++) {
long long int j;
const double Aik = A[i][k];
for(j = k + 1; j < n; j++) {
A[i][j] -= Aik * A[k][j];
}
}
}
i tried using collapse but failed the error it was showing was work-sharing region may not be closely nested inside of work-sharing, ‘loop’, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘task loop’ region.
After what i though was correct, the time increased as i executed the code with more threads.
I tired using collapse. this is the output:
26:17: error: collapsed loops not perfectly nested before ‘for’
26 | for(i = k + 1; i < n; i++) {
This is LU-Decomposition

Best approach to parallelize BW and FW algorithms

I have implemented the BW and FW algorithms to solve L and U triangular matrix.
The algorithm that I implement run very fast in a serial way, but I can not figure out if this is the best method to parallelize it.
I think that I have taken into account every possible data race (on alpha), am I right?
void solveInverse (double **U, double **L, double **P, int rw, int cw) {
double **inverseA = allocateMatrix(rw,cw);
double* x = allocateArray(rw);
double* y = allocateArray(rw);
double alpha;
//int i, j, t;
// Iterate along the column , so at each iteration we generate a column of the inverse matrix
for (int j = 0; j < rw; j++) {
// Lower triangular solve Ly=P
y[0] = P[0][j];
#pragma omp parallel for reduction(+:alpha)
for (int i = 1; i < rw; i++) {
alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
// Upper triangular solve Ux=P
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp parallel for reduction(+:alpha)
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
freeMemory(inverseA,rw);
free(x);
free(y);
}
After a private discussion with the user dreamcrash, we have come to the solution proposed in his comments, creating a couple of vector x and y for each thread that will work indipendently on a single column.

After a discussion with the OP on the comments (that were removed afterwards), we both came to the conclusion that:
You do not need to reduce the alpha variable, because outside the first parallel region it is initialized again to zero. Instead, make the alpha variable private.
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
and the same applies to the second parallel region as well.
#pragma omp parallel for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
Instead of having one parallel region per j iteration. You can extract the parallel region to encapsulate the entire outermost loop, and use #pragma omp for instead of #pragma omp parallel for. Notwithstanding, although with this approach we reduced the number of parallel regions created from rw to only 1, the speedup achieved with this optimization should not be that significant, because an efficient OpenMP implementation will use a pool of threads, where the threads are initialized on the first parallel region but reused on the next parallel regions. Consequently, saving on the overhead of creating and destroying threads.
#pragma omp parallel
{
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
#pragma omp for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
#pragma omp for
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
}
I have shown you this code transformations so that you could see some potential tricks that you can use on other future parallelizations. Unfortunately, as it is that parallelization will not work.
Why?
Let us look at the first loop:
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
there is a dependency between y[t] being read in alpha += L[i][t] * y[t]; and y[i] being written in y[i] = P[i][j] - alpha;.
So what you can do instead is to parallelize the outermost loop (i.e., assign each column to the threads) and create separate x and y arrays for each thread so that there is no race-conditions during the updates/reads of those arrays.
#pragma omp parallel
{
double* x = allocateArray(rw);
double* y = allocateArray(rw);
#pragma omp for
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
for (int i = rw-2; i >= 0; i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
free(x);
free(y);
}

OpenMP Performance Issues with Matrix Multiplication

I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}

(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.

OpenMP Matrix Multiplcation Critical Section

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp critical
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I also tried using:
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp atomic
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
But that didn't work either. I also tried it without the second #pragma, and with:
void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.
EDIT: Here is a more complete version of the code:
double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;
void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
sum = 0;
for (k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
for(j=0; j < N; j++){
A[i][j] = i+j;
B[i][j] = i+j;
}
}
for(i=0; i < N; i++){
for(j=0; j < N; j++){
C[i][j] = 0;
}
}
if (CHOOSE == 0) {
matrixSequential();
}
else if(CHOOSE == 1) {
matrixParallel();
}
//checking for correctness
double sum;
for(i=0; i < N; i++){
sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}

Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.
Update
After you updated to provide a MVCE, #Zboson found the actual bug: you were declaring the arrays as double but adding them as int.

IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.
Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.
However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.
Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.

What is a right way to use task directive in OpenMP

I am trying to multiply two matrices using OpenMP task. This is a basic code:
long i, j, k;
for (i = 0; i < N; i ++)
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
So, I want to use task on column level and then I modified code like this:
long i, j, k;
#pragma omp parallel
{
#pragma omp single
{
for (i = 0; i < N; i ++)
#pragma omp task private(i, j, k)
{
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
When I run a program I get message like this:
Segmentation fault (core dumped)
Now, I know I'm missing some piece, but can't figure it what. Any idea?

private variables in OpenMP are not initialised and have random initial values. When the task executes, i would have random value and therefore probably lead to an out-of-bound access of c[] and a[].
firstprivate variables are similar to private, but have their initial value set to the value that the referenced variable had at the moment the construct is entered. In your case i has to be firstprivate and not private.
Also it is advisable that i is made private in the parallel region for a small performance increase. Thus the final code should look like this (with all variable sharing classes written explicitly and private variables declared in their use scopes):
#pragma omp parallel shared(a, b, c)
{
#pragma omp single
{
long i;
for (i = 0; i < N; i++)
#pragma omp task shared(a, b, c) firstprivate(i)
{
int j, k;
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OpenMP parallel code slower - c

Related

parallelize nested for loop

Best approach to parallelize BW and FW algorithms

OpenMP Performance Issues with Matrix Multiplication

OpenMP Matrix Multiplcation Critical Section

What is a right way to use task directive in OpenMP

Categories

Resources