I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.
Related
My code aims at combining two arrays into a third one, and doing this multiple times. I'd like to parallelize it using the means of openMP, but the result is not what I expect it to be. Namely, already the very first check breaks at a[0]!=0.
The MWE is:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int i = 0;
int j = 0;
long length = 10000;
long nSamples = 1000;
double *a, *b, *c;
double start = 0;
double end = 0;
int nthreads = 0;
// Array Allocation and Initialisation
a = (double *)malloc(length * sizeof(double));
b = (double *)malloc(length * sizeof(double));
c = (double *)malloc(length * sizeof(double));
for (i = 0; i < length; i++)
{
a[i] = 0.0;
b[i] = 0.0;
c[i] = 0.0;
}
// Get the maximum number of threads and evaluate the number of threads per group
#pragma omp parallel shared(nthreads)
{
nthreads = omp_get_num_threads() / 2;
}
#pragma omp parallel for shared(nthreads, a, b, c) num_threads(nthreads) collapse(2)
for (i = 0; i < nSamples; i++)
{
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
/*Check correctness*/
for (i = 0; i < length; i++)
{
if (a[i] != 2.0 * nSamples)
{
printf("a not equal at %d\n", i);
break;
}
if (b[i] != 2.0 * nSamples)
{
printf("b ot equal at %d\n", i);
break;
}
if (c[i] != 4.0 * nSamples)
{
printf("c ot equal at %d\n", i);
break;
}
}
free(a);
free(b);
free(c);
return 0;
}
If I only parallelize the inner loop, the check works fine. To achieve this, I use:
for (i = 0; i < nSamples; i++)
{
#pragma omp parallel for private(j) shared(a, b, c) num_threads(nthreads)
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
I know that openMP treats each iteration of a loop as independent of other iterations. I assume this causes e.g. iteration i=5 on thread 1 to write into a[j] of iteration i=9 of another thread, causing the difference. Is this correct? Can this be avoided, using the means of openMP?
So, I was trying to write a program to do matrix multiplication using multiple threads and then plot a graph between the time taken and the number of threads used.
I used the following approach:
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
pthread_mutex_t lock;
#define M 200
#define N 300
#define P 400
#define X 2 // Number of Threads
#define RED "\x1b[31m"
#define GREEN "\x1b[32m"
int A[M][N], B[N][P], C[M][P], D[M][P];
int row = 0;
void *matrixMulti(void *arg)
{
pthread_mutex_lock(&lock);
int i = row++;
for (int j = 0; j < P; j++)
{
C[i][j] = 0;
for (int k = 0; k < N; k++)
{
C[i][j] += A[i][k] * B[k][j];
}
}
pthread_exit(NULL);
pthread_mutex_unlock(&lock);
}
void matrixMultiplicationWithoutThreading();
void matrixMultiplicationWithThreading();
void verifyIfBothMatrixAreSame();
int main()
{
int m, n, p;
// A: m*n Matrix, B: n*p Matrix
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
A[i][j] = rand() % 10;
// scanf("%d", &A[i][j]);
for (int i = 0; i < N; i++)
for (int j = 0; j < P; j++)
B[i][j] = rand() % 10;
// scanf("%d", &B[i][j]);
struct timeval start, end;
gettimeofday(&start, NULL);
matrixMultiplicationWithoutThreading();
gettimeofday(&end, NULL);
double time = (end.tv_sec - start.tv_sec) * 1e6;
time = (time + end.tv_usec - start.tv_usec) * 1e-6;
printf("The time taken by simple matrix calculation without threding is %0.6f\n", time);
struct timeval start_th, end_th;
gettimeofday(&start_th, NULL);
matrixMultiplicationWithThreading();
gettimeofday(&end_th, NULL);
time = (end_th.tv_sec - start_th.tv_sec) * 1e6;
time = (time + end_th.tv_usec - start_th.tv_usec) * 1e-6;
printf("The time taken by using the Threading Method with %d threads is %0.6f\n", X, time);
verifyIfBothMatrixAreSame();
}
void matrixMultiplicationWithThreading()
{
pthread_t threads[X];
for (int i = 0; i < X; i++)
{
threads[i] = (pthread_t)-1;
}
// Computation Started:
for (int i = 0; i < M; i++)
{
// At any moment only X threads at max are working
if (threads[i] == (pthread_t)-1)
pthread_create(&threads[i % X], NULL, matrixMulti, NULL);
else
{
pthread_join(threads[i % X], NULL);
pthread_create(&threads[i % X], NULL, matrixMulti, NULL);
}
}
for (int i = 0; i < X; i++)
pthread_join(threads[i], NULL);
// Computation Done:
}
void matrixMultiplicationWithoutThreading()
{
// Computation Started:
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
D[i][j] = 0;
for (int k = 0; k < N; k++)
D[i][j] += A[i][k] * B[k][j];
}
// Computation Done:
}
void verifyIfBothMatrixAreSame()
{
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
if (C[i][j] != D[i][j])
{
printf(RED "\nMatrix's are not equal something wrong with the computation\n");
return;
}
}
printf(GREEN "\nBoth Matrixes are equal thus verifying the computation\n");
}
Now, this code works sometimes, and sometimes it doesn't, like the result does not match the actual result. Similarly, this code gives a segmentation fault in one of the Linux virtual machines. Also, even when it works correctly, it doesn't give the asymptotically decreasing graph. Instead, the time is almost constant with arbitrary variations with the thread number.
Can someone help with this, like why this is happening? I found multiple solutions to this problem on the internet; some of them don't work (rarely but it happens), but I haven't seen my approach yet; it might be an issue I think. So, can anyone comment on using pthread_create(&threads[i % X], NULL, matrixMulti, NULL), like why this is not a good idea?
EDITED:
I have tried taking the suggestion and optimising the code, I have not done the Matrix multiplication efficient method, as we were asked to do the O(n^3) method, but I have tried doing the threading correctly. Is this correct?
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <math.h>
#define M 2
#define N 2
#define P 2
#define X 40 // Number of Threads
#define RED "\x1b[31m"
#define GREEN "\x1b[32m"
int t = 0; // Computation done by the first usedXFullthreads
int usedXFull = 0;
int A[M][N], B[N][P], C[M][P], D[M][P];
int row = 0;
void *matrixMulti(void *arg)
{
int* l = (int *)arg;
int n = *l;
int i = 0, j = 0, k = 0, comp = 0;
if (n <= usedXFull)
{
i = n * t / (N * P);
j = (n * t - N * P * i) / N;
k = n * t - N * P * i - N * j;
if (n == usedXFull)
comp = M * N * P - usedXFull * t;
else
comp = t;
}
while (comp)
{
if (i == M)
printf(RED "Some fault in the code\n\n");
C[i][j] += A[i][k] * B[k][j];
comp--;
k++;
if (k == N)
{
j++;
if (j == P)
{
i++;
j = 0;
}
k = 0;
}
}
return NULL;
}
void matrixMultiplicationWithoutThreading();
void matrixMultiplicationWithThreading();
void verifyIfBothMatrixAreSame();
int main()
{
int m, n, p;
// A: m*n Matrix, B: n*p Matrix
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
A[i][j] = rand() % 10;
// scanf("%d", &A[i][j]);
for (int i = 0; i < N; i++)
for (int j = 0; j < P; j++)
B[i][j] = rand() % 10;
// scanf("%d", &B[i][j]);
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
C[i][j] = 0;
struct timeval start, end;
gettimeofday(&start, NULL);
matrixMultiplicationWithoutThreading();
gettimeofday(&end, NULL);
double time = (end.tv_sec - start.tv_sec) * 1e6;
time = (time + end.tv_usec - start.tv_usec) * 1e-6;
printf("The time taken by simple matrix calculation without threding is %0.6f\n", time);
struct timeval start_th, end_th;
gettimeofday(&start_th, NULL);
matrixMultiplicationWithThreading();
gettimeofday(&end_th, NULL);
time = (end_th.tv_sec - start_th.tv_sec) * 1e6;
time = (time + end_th.tv_usec - start_th.tv_usec) * 1e-6;
printf("The time taken by using the Threading Method with %d threads is %0.6f\n", X, time);
verifyIfBothMatrixAreSame();
}
void matrixMultiplicationWithThreading()
{
int totalComp = M * N * P; // Total Computation
t = ceil((double)totalComp / (double)X);
usedXFull = totalComp / t;
int computationByLastUsedThread = totalComp - t * usedXFull;
int computationIndex[X];
pthread_t threads[X];
// Computation Started:
for (int i = 0; i < X; i++)
{
computationIndex[i] = i;
int rc = pthread_create(&threads[i], NULL, matrixMulti, (void *)&computationIndex[i]);
if (rc)
{
printf(RED "ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
for (int i = 0; i < X; i++)
pthread_join(threads[i], NULL);
// Computation Done:
}
void matrixMultiplicationWithoutThreading()
{
// Computation Started:
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
D[i][j] = 0;
for (int k = 0; k < N; k++)
D[i][j] += A[i][k] * B[k][j];
}
// Computation Done:
}
void verifyIfBothMatrixAreSame()
{
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
if (C[i][j] != D[i][j])
{
printf(RED "\nMatrix's are not equal something wrong with the computation\n");
return;
}
}
printf(GREEN "\nBoth Matrixes are equal thus verifying the computation\n");
}
There are many issues in the code. Here are some points:
lock is not initialized with pthread_mutex_init which is required (nor freed).
There is no need for locks in a matrix multiplication: work sharing should be preferred instead (especially since the current lock make your code run fully serially).
Using pthread_exit is generally rather a bad idea, at least it is here. Consider just returning NULL. Besides, returning something is mandatory in matrixMulti. Please enable compiler warnings so to detect such a thing.
There is an out of bound of threads[i] in the 0..M based loop.
There is no need to create M threads. You can create 2 threads and divide the work in 2 even parts along the M-based dimension. Creating M threads while allowing only 2 threads to run simultaneously just add more overhead for no reason (it takes time for thread to be created and scheduled by the OS).
It is generally better to dynamically allocate large arrays than using static global C arrays.
It is better to avoid global variables and use the arg parameter so to get thread-specific data.
To design a fast matrix multiplication, please consider reading this article. For example, the ijk loop nest is very inefficient and should really not be used for sake of performance (not efficient in cache). Besides, note you can use a BLAS library for that (they are highly optimized and easy to use) though I guess this is a homework. Additionally, note that you can use OpenMP instead of pthread so to make the code shorter and easy to read.
I am new to C and have created a program that creates two arrays and then multiplies them using openMP. When I compare them the sequential is quicker than the openMP way.
#include <stdio.h>
#include <stdio.h>
#include <omp.h>
#include <time.h>
#define SIZE 1000
int arrayOne[SIZE][SIZE];
int arrayTwo[SIZE][SIZE];
int arrayThree[SIZE][SIZE];
int main()
{
int i=0, j=0, k=0, sum = 0;
//creation of the first array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayOne[i][j] = 2;
/*printf("%d \t", arrayOne[i][j]);*/
}
}
//creation of the second array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayTwo[i][j] = 3;
/*printf("%d \t", arrayTwo[i][j]);*/
}
}
clock_t begin = clock();
//Matrix Multiplication (No use of openMP)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time taken without openMp: %f \n", time_spent);
//Matrix Multiplication Using openMP
printf("---------------------\n");
clock_t new_begin = clock();
#pragma omp parallel private(i, j, sum, k) shared (arrayOne, arrayTwo, arrayThree)
{
#pragma omp for schedule(static)
for (i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
for(k = 0; k < SIZE; k++) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
}
clock_t new_end = clock();
double new_time_spent = (double)(new_end - new_begin) / CLOCKS_PER_SEC;
printf("Time taken WITH openMp: %f ", new_time_spent);
return 0;
}
The sequential way takes 0.265000 while the openMP takes 0.563000.
I have no idea why, any solutions?
Updated code to global arrays and make them larger but still takes double the run time.
OpenMP needs to create and destroy threads, which results in extra overhead. Such overhead is quite small, but for a very small workload, like yours, it can still be significant.
to use OpenMP more efficiently, you should give it a large workload (larger matrices) and make thread-related overhead not the dominant factor in your runtime.
I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}
First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range
I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}
(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.