OpenMP Performance Issues with Matrix Multiplication - c

I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}

(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.

Related

Multiplying matrix openMP is slower than sequential

I am new to C and have created a program that creates two arrays and then multiplies them using openMP. When I compare them the sequential is quicker than the openMP way.
#include <stdio.h>
#include <stdio.h>
#include <omp.h>
#include <time.h>
#define SIZE 1000
int arrayOne[SIZE][SIZE];
int arrayTwo[SIZE][SIZE];
int arrayThree[SIZE][SIZE];
int main()
{
int i=0, j=0, k=0, sum = 0;
//creation of the first array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayOne[i][j] = 2;
/*printf("%d \t", arrayOne[i][j]);*/
}
}
//creation of the second array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayTwo[i][j] = 3;
/*printf("%d \t", arrayTwo[i][j]);*/
}
}
clock_t begin = clock();
//Matrix Multiplication (No use of openMP)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time taken without openMp: %f \n", time_spent);
//Matrix Multiplication Using openMP
printf("---------------------\n");
clock_t new_begin = clock();
#pragma omp parallel private(i, j, sum, k) shared (arrayOne, arrayTwo, arrayThree)
{
#pragma omp for schedule(static)
for (i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
for(k = 0; k < SIZE; k++) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
}
clock_t new_end = clock();
double new_time_spent = (double)(new_end - new_begin) / CLOCKS_PER_SEC;
printf("Time taken WITH openMp: %f ", new_time_spent);
return 0;
}
The sequential way takes 0.265000 while the openMP takes 0.563000.
I have no idea why, any solutions?
Updated code to global arrays and make them larger but still takes double the run time.
OpenMP needs to create and destroy threads, which results in extra overhead. Such overhead is quite small, but for a very small workload, like yours, it can still be significant.
to use OpenMP more efficiently, you should give it a large workload (larger matrices) and make thread-related overhead not the dominant factor in your runtime.

Not responding during exeuting basic OpenMP (C) program

I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}
First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range

Why is my program generating random results when I nest it?

I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.

OpenMP Matrix Multiplcation Critical Section

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp critical
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I also tried using:
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp atomic
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
But that didn't work either. I also tried it without the second #pragma, and with:
void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.
EDIT: Here is a more complete version of the code:
double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;
void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
sum = 0;
for (k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
for(j=0; j < N; j++){
A[i][j] = i+j;
B[i][j] = i+j;
}
}
for(i=0; i < N; i++){
for(j=0; j < N; j++){
C[i][j] = 0;
}
}
if (CHOOSE == 0) {
matrixSequential();
}
else if(CHOOSE == 1) {
matrixParallel();
}
//checking for correctness
double sum;
for(i=0; i < N; i++){
sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}
Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.
Update
After you updated to provide a MVCE, #Zboson found the actual bug: you were declaring the arrays as double but adding them as int.
IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.
Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.
However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.
Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.

OpenMP parallel code slower

I have two loops which I am parallelizing
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?
The iterations of your parallel outer loops share the index variables (j and k) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, i.e., your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.
What is worse, is that, because of this, your code contains race conditions. As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))
What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k. You can achieve this either by declaring these variables within the scope of the parallel loops:
int i;
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j1, k1; /* explicit local copies */
for (j1 = 0; j1 < nj; j1++) {
C[i][j1] = 0;
for (k1 = 0; k1 < nk; ++k1)
C[i][j1] += A[i][k1] * B[k1][j1];
}
}
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j2, k2; /* explicit local copies */
for (j2 = 0; j2 < nl; j2++) {
E[i][j2] = 0;
for (k2 = 0; k2 < nj; ++k2)
E[i][j2] += C[i][k2] * D[k2][j2];
}
}
or otherwise declaring them as private in your loop pragmas:
int i, j, k;
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.
To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if-clauses):
#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}

Resources