Wrong result with OpenMP to parallelize GEMM - c

I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.
void omp2(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
#pragma omp parallel for
for (int ki = 0; ki < k; ++ki) {
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
for (int ki = 0; ki < k; ++ki) {
#pragma omp parallel for
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}

The problem is that there is a race condition in this line:
C[i * n + j] += ...
Different threads can read and write the same memory location (C[i * n + j]) simultaneously, which causes data race. In omp2 this data race can occur, but not in omp3.
The solution is (as suggested by #Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n + j] is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):
#pragma omp parallel for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
double sum=0;
for (int ki = 0; ki < k; ++ki) {
sum += A[i * k + ki] * B[ki * n + j];
}
C[i * n + j] +=sum;
}
}
Note that you can use collapse(2) clause, which may increase the performance.

Related

Why does this openMP loop not return the correct computation?

My code aims at combining two arrays into a third one, and doing this multiple times. I'd like to parallelize it using the means of openMP, but the result is not what I expect it to be. Namely, already the very first check breaks at a[0]!=0.
The MWE is:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int i = 0;
int j = 0;
long length = 10000;
long nSamples = 1000;
double *a, *b, *c;
double start = 0;
double end = 0;
int nthreads = 0;
// Array Allocation and Initialisation
a = (double *)malloc(length * sizeof(double));
b = (double *)malloc(length * sizeof(double));
c = (double *)malloc(length * sizeof(double));
for (i = 0; i < length; i++)
{
a[i] = 0.0;
b[i] = 0.0;
c[i] = 0.0;
}
// Get the maximum number of threads and evaluate the number of threads per group
#pragma omp parallel shared(nthreads)
{
nthreads = omp_get_num_threads() / 2;
}
#pragma omp parallel for shared(nthreads, a, b, c) num_threads(nthreads) collapse(2)
for (i = 0; i < nSamples; i++)
{
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
/*Check correctness*/
for (i = 0; i < length; i++)
{
if (a[i] != 2.0 * nSamples)
{
printf("a not equal at %d\n", i);
break;
}
if (b[i] != 2.0 * nSamples)
{
printf("b ot equal at %d\n", i);
break;
}
if (c[i] != 4.0 * nSamples)
{
printf("c ot equal at %d\n", i);
break;
}
}
free(a);
free(b);
free(c);
return 0;
}
If I only parallelize the inner loop, the check works fine. To achieve this, I use:
for (i = 0; i < nSamples; i++)
{
#pragma omp parallel for private(j) shared(a, b, c) num_threads(nthreads)
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
I know that openMP treats each iteration of a loop as independent of other iterations. I assume this causes e.g. iteration i=5 on thread 1 to write into a[j] of iteration i=9 of another thread, causing the difference. Is this correct? Can this be avoided, using the means of openMP?

why is this matrix multiplication algorithm faster than the other?

int mmult_omp(double *c,
double *a, int aRows, int aCols,
double *b, int bRows, int bCols, int numThreads)
{
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
}
for (k = 0; k < aCols; k++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
for (k = 0; k < aCols; k++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
Why is the first algorithm faster than the second?
I’ve used C’s time library and the first algorithm is objectively faster than the second. Why is that?
This code is very hard to understand. I had to copy it and reformat it to see what loops were what. I'm not really sure why one is faster but here's a great resource to see why.
Here are links to inspect the assembly output:
link for #1
link for #2

Multiplying 1D matrixes like 2D matrixes

I am trying to multiply a 1D matrix as a 2D matrix in C.
Here is one example of what result I get with a 2D loop:
(the + should be a * I miss typed)
So I get a matrix C with the values
{
{2,3},
{6,11}
};
Here is the code for the 2D array in C:
void multiply(int n, double ** a, double ** b, double ** c) {
int i, j, k;
for (i = 1; i < n; i++){
for (j = 1; j < n; j++){
for (k = 1; k < n; k++){
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
Now, I am trying to do the same, but for a 1D matrix, like in the picture:
(the + should be a * I miss typed)
And here is the code for the 1D array:
void multiply(int n, double * a, double * b, double * c) {
int i, j, k;
for (i = 0; i < n*n; i++) {
for (j = 0; j < n*n; j++) {
for (k = 0; k < n*n; k++) {
c[i]+= a[j]*b[k];
}
}
}
}
After running it, I get the result {14400, 14400,14400,14400} instead of {2,3,6,11}
It looks like you just want to do matrix multiplication while working with one-D arrays instead of 2-d. Not sure why you'd want to do that but you could do something like this:
void multiply(int n, double *a, double *b, double *c) {
int i, j, k;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i * n + j] += a[i * n + k] * b[k * n + j];
}
}
}
}

Why is my program generating random results when I nest it?

I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.

What is a right way to use task directive in OpenMP

I am trying to multiply two matrices using OpenMP task. This is a basic code:
long i, j, k;
for (i = 0; i < N; i ++)
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
So, I want to use task on column level and then I modified code like this:
long i, j, k;
#pragma omp parallel
{
#pragma omp single
{
for (i = 0; i < N; i ++)
#pragma omp task private(i, j, k)
{
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
When I run a program I get message like this:
Segmentation fault (core dumped)
Now, I know I'm missing some piece, but can't figure it what. Any idea?
private variables in OpenMP are not initialised and have random initial values. When the task executes, i would have random value and therefore probably lead to an out-of-bound access of c[] and a[].
firstprivate variables are similar to private, but have their initial value set to the value that the referenced variable had at the moment the construct is entered. In your case i has to be firstprivate and not private.
Also it is advisable that i is made private in the parallel region for a small performance increase. Thus the final code should look like this (with all variable sharing classes written explicitly and private variables declared in their use scopes):
#pragma omp parallel shared(a, b, c)
{
#pragma omp single
{
long i;
for (i = 0; i < N; i++)
#pragma omp task shared(a, b, c) firstprivate(i)
{
int j, k;
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}

Resources