Multiplying matrix openMP is slower than sequential

Multiplying matrix openMP is slower than sequential - c

I am new to C and have created a program that creates two arrays and then multiplies them using openMP. When I compare them the sequential is quicker than the openMP way.
#include <stdio.h>
#include <stdio.h>
#include <omp.h>
#include <time.h>
#define SIZE 1000
int arrayOne[SIZE][SIZE];
int arrayTwo[SIZE][SIZE];
int arrayThree[SIZE][SIZE];
int main()
{
int i=0, j=0, k=0, sum = 0;
//creation of the first array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayOne[i][j] = 2;
/*printf("%d \t", arrayOne[i][j]);*/
}
}
//creation of the second array
for(i = 0; i < SIZE; i++){
for(j = 0; j < SIZE; j++){
arrayTwo[i][j] = 3;
/*printf("%d \t", arrayTwo[i][j]);*/
}
}
clock_t begin = clock();
//Matrix Multiplication (No use of openMP)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time taken without openMp: %f \n", time_spent);
//Matrix Multiplication Using openMP
printf("---------------------\n");
clock_t new_begin = clock();
#pragma omp parallel private(i, j, sum, k) shared (arrayOne, arrayTwo, arrayThree)
{
#pragma omp for schedule(static)
for (i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
for(k = 0; k < SIZE; k++) {
sum = sum + arrayOne[i][k] * arrayTwo[k][j];
}
arrayThree[i][j] = sum;
sum = 0;
}
}
}
clock_t new_end = clock();
double new_time_spent = (double)(new_end - new_begin) / CLOCKS_PER_SEC;
printf("Time taken WITH openMp: %f ", new_time_spent);
return 0;
}
The sequential way takes 0.265000 while the openMP takes 0.563000.
I have no idea why, any solutions?
Updated code to global arrays and make them larger but still takes double the run time.

OpenMP needs to create and destroy threads, which results in extra overhead. Such overhead is quite small, but for a very small workload, like yours, it can still be significant.
to use OpenMP more efficiently, you should give it a large workload (larger matrices) and make thread-related overhead not the dominant factor in your runtime.

Related

Not responding during exeuting basic OpenMP (C) program

I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}

First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range

Sum of matrix elements on a parallel region resulting on wrong answers on OpenMP

I was doing an activity at my university that requires to populate a matrix of [2000][2000] elements and then calculate the sum of all elements that are multiples of 5 in a parallel way.
At first I tried using a 5 x 5 matrix, I did a parcial sum (sumP) of the elements and them I added all the elements on a variable called Sum into a critical region.
On my university computer the parcial sum was receiving thrash values (like 36501) when the values must be lower than 100; I noted that it only happend on the [0][i] (line zero) of the matrix.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 5
int main() {
int i, j, k, l;
int sum = 0;
int sumP = 0;
int A[N][N];
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel shared (A) private (i, j)
{
#pragma omp for
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5;
printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
}
#pragma omp parallel shared(A, sum) private (k, l, sumP)
{
#pragma omp for
for (k = 0; k < N; k++) {
for (l = 0; l < N; l++){
if (A[l][k] % 5 == 0 && A[l][k] != 0){
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}
I tested it declaring the value of sumP to 0 between the "for" statemants, and it worked:
#pragma omp parallel shared(A, soma) private (k, l, somap2)
{
#pragma omp for
for (k = 0; k < N; k++) {
sumP = 0;
for (l = 0; l < N; l++){
when I tested it home it worked without having to declare the sumP as 0 (on the parcial sum "sumP"), like I did above, but now the final Sum result is not correct...

You observe this behavior because private variables in OpenMP are uninitialized. To be precise, they are initialized as if you would have a local variable without an explicit initialization. Which means it is undefined what value they have initially. You observe different behavior on different systems because some combinations of compiler, options, and OS use this "undefined" differently. Your code is incorrect in any case, even if it sometimes produces the correct result.
Now you can do this setting to zero as you tried out. However, I would generally suggest to instead declare variables as local as possible. This makes reasoning about the (parallel) code much easier, and you can omit the "private/shared" declarations. So your code would look like this:
#pragma omp parallel
{
int sumP = 0;
#pragma omp for
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
In addition to that, there is another way to drastically simplify this code by using a reduction:
#pragma omp parallel for reduction(+:sum)
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sum += A[k][l];
}
}
}
The compiler will basically do the same thing for you (but better) and the code is much cleaner.

Considering that your code would spend most of its time dealing with I/O it would be a good idea to comment the printf
But as I understand sumP should contain the partial sum of your inner loop
Pragmas have been compressed for readability
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 1000
int main() {
int i, j;
int sum = 0;
int sumP = 0;
int A[N][N]; // will cause segfault with large N
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel for shared (A) private (i, j)
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5; // populate array with numbers in [0,1,2,3,4]
//printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
#pragma omp parallel for shared(A) private (i, j, sumP) reduction(+: sum)
for (i = 0; i< N; i++) { // outer (parallel)loop
sumP = 0; // initialize partial sum
for (j = 0; j < N; j++){ // inner sequential loop
//if (A[i][j] % 5 == 0 && A[i][j] != 0){ // Explain this condition
sumP += A[i][j];
//printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[i][j], i, j, sumP);
//}
}
//printf("sumP: %i\n", sumP);
sum += sumP; // add partial sum
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}

Why is my program generating random results when I nest it?

I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}

These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.

OpenMP Performance Issues with Matrix Multiplication

I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}

(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.

OpenMP Matrix Multiplcation Critical Section

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp critical
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I also tried using:
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp atomic
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
But that didn't work either. I also tried it without the second #pragma, and with:
void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.
EDIT: Here is a more complete version of the code:
double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;
void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
sum = 0;
for (k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
for(j=0; j < N; j++){
A[i][j] = i+j;
B[i][j] = i+j;
}
}
for(i=0; i < N; i++){
for(j=0; j < N; j++){
C[i][j] = 0;
}
}
if (CHOOSE == 0) {
matrixSequential();
}
else if(CHOOSE == 1) {
matrixParallel();
}
//checking for correctness
double sum;
for(i=0; i < N; i++){
sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}

Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.
Update
After you updated to provide a MVCE, #Zboson found the actual bug: you were declaring the arrays as double but adding them as int.

IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.
Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.
However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.
Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Multiplying matrix openMP is slower than sequential - c

Related

Not responding during exeuting basic OpenMP (C) program

Sum of matrix elements on a parallel region resulting on wrong answers on OpenMP

Why is my program generating random results when I nest it?

OpenMP Performance Issues with Matrix Multiplication

OpenMP Matrix Multiplcation Critical Section

Categories

Resources