Parallelizing a function in C OpenMP

Parallelizing a function in C OpenMP - c

How do i parallelize this function in OpenMP for C
int zeroRow(int**A,int n) {
int i, j, sum, num = 0;
for(i= 0;i< n;i++) {
sum = 0;
for(j = 0; j < n; j++) {
sum += A[i][j];
}
if(sum == 0) {
num++;
}
}
return num;
}
I did this check if this is the right procedure.
int zeroRow(int**A,int n) {
int num = 0;
#pragma omp parallel for reduction(+:num);
for(int i= 0;i< n;i++) {
int sum = 0;
for(int j = 0; j < n; j++) {
sum += A[i][j];
}
if(sum == 0) {
num++;
}
}
return num;
}
please tell me if what i have done is right or wring i have parallelized the outer loop using reduction and a separate num is given to each thread.

Looks correct parallelized.
The only thing you should add is a term specifying the use of A.
You rely that the default case is shared. You should explicitly name the status with
#pragma omp parallel for reduction(+:num) default(shared)
or
#pragma omp parallel for reduction(+:num) shared(A)
also you do not need to write a semicolon (;) at the end of the pragma line (but writing it would be no error)

Related

C OMP for loop in parallel region. Not work-shared

I have a function that I want to parallelize. This is the serial version.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j;
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
b[irem[j] - 1] += xrem[j]*x[i];
}
}
}
I figured a decent way to do this was to have each thread write to a private copy of the b array (which does not need to be a protected critical section because its a private copy), after the thread is done, it will then copy its results to the actual b array. Here is my code.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j, k;
#pragma omp parallel private(i, j, k)
{
float* b_local = (float*)malloc(sizeof(b));
#pragma omp for nowait
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[index] += current_add;
}
}
for (k = 0; k < sizeof(b) / sizeof(b[0]); k++)
{
// Separate question: Is this if statement allowed?
//if (b_local[k] == 0) { continue; }
#pragma omp atomic
b[k] += b_local[k];
}
}
}
However, I get a segmentation fault as a result of the second for loop. I do not need to a "#pragma omp for" on that loop because I want each thread to execute it fully. If I comment out the content inside the for loop, no segmentation fault. I am not sure what the issue would be.

You're probabily trying to access an out of range position in the dynamic array b_local.
See that sizeof(b) will return the size in bytes of float* (size of a float pointer).
If you want to know the size of the array that you are passing to the function, i would suggest you add it to the parameters of the function.
void parallelCSC_SpMV(float *x, float *b, int b_size){
...
float* b_local = (float*) malloc(sizeof(float)*b_size);
...
}
And, if the size of colptrs is numcols i would be careful with colptrs[i+1], since when i=numcols-1 will have another out of range problem.

First, as pointed out by Jim Cownie:
In all of these answers, b_local is uninitialised, yet you are adding
to it. You need to use calloc instead of malloc
Just to add to the accepted answer, I thing you can try the following approach to avoid calling malloc in parallel, and also the overhead of calling #pragma omp atomic.
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads];
for(int i = 0; i < num_threads; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[tid][index] += current_add;
}
}
}
for(int id = 0; id < num_threads; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I have not tested the performance of this, so please feel free to do so and provide feedback.
You can further optimize by instead of creating a local_b for the master thread just reused the original b, as follows:
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads-1];
for(int i = 0; i < num_threads-1; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
float *thread_b = (tid == 0) ? b : b_local[tid-1];
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
thread_b[index] += current_add;
}
}
}
for(int id = 0; id < num_threads-1; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}

Not responding during exeuting basic OpenMP (C) program

I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}

First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range

Sum of matrix elements on a parallel region resulting on wrong answers on OpenMP

I was doing an activity at my university that requires to populate a matrix of [2000][2000] elements and then calculate the sum of all elements that are multiples of 5 in a parallel way.
At first I tried using a 5 x 5 matrix, I did a parcial sum (sumP) of the elements and them I added all the elements on a variable called Sum into a critical region.
On my university computer the parcial sum was receiving thrash values (like 36501) when the values must be lower than 100; I noted that it only happend on the [0][i] (line zero) of the matrix.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 5
int main() {
int i, j, k, l;
int sum = 0;
int sumP = 0;
int A[N][N];
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel shared (A) private (i, j)
{
#pragma omp for
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5;
printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
}
#pragma omp parallel shared(A, sum) private (k, l, sumP)
{
#pragma omp for
for (k = 0; k < N; k++) {
for (l = 0; l < N; l++){
if (A[l][k] % 5 == 0 && A[l][k] != 0){
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}
I tested it declaring the value of sumP to 0 between the "for" statemants, and it worked:
#pragma omp parallel shared(A, soma) private (k, l, somap2)
{
#pragma omp for
for (k = 0; k < N; k++) {
sumP = 0;
for (l = 0; l < N; l++){
when I tested it home it worked without having to declare the sumP as 0 (on the parcial sum "sumP"), like I did above, but now the final Sum result is not correct...

You observe this behavior because private variables in OpenMP are uninitialized. To be precise, they are initialized as if you would have a local variable without an explicit initialization. Which means it is undefined what value they have initially. You observe different behavior on different systems because some combinations of compiler, options, and OS use this "undefined" differently. Your code is incorrect in any case, even if it sometimes produces the correct result.
Now you can do this setting to zero as you tried out. However, I would generally suggest to instead declare variables as local as possible. This makes reasoning about the (parallel) code much easier, and you can omit the "private/shared" declarations. So your code would look like this:
#pragma omp parallel
{
int sumP = 0;
#pragma omp for
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
In addition to that, there is another way to drastically simplify this code by using a reduction:
#pragma omp parallel for reduction(+:sum)
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sum += A[k][l];
}
}
}
The compiler will basically do the same thing for you (but better) and the code is much cleaner.

Considering that your code would spend most of its time dealing with I/O it would be a good idea to comment the printf
But as I understand sumP should contain the partial sum of your inner loop
Pragmas have been compressed for readability
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 1000
int main() {
int i, j;
int sum = 0;
int sumP = 0;
int A[N][N]; // will cause segfault with large N
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel for shared (A) private (i, j)
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5; // populate array with numbers in [0,1,2,3,4]
//printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
#pragma omp parallel for shared(A) private (i, j, sumP) reduction(+: sum)
for (i = 0; i< N; i++) { // outer (parallel)loop
sumP = 0; // initialize partial sum
for (j = 0; j < N; j++){ // inner sequential loop
//if (A[i][j] % 5 == 0 && A[i][j] != 0){ // Explain this condition
sumP += A[i][j];
//printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[i][j], i, j, sumP);
//}
}
//printf("sumP: %i\n", sumP);
sum += sumP; // add partial sum
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}

OpenMP Performance Issues with Matrix Multiplication

I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}

(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.

OpenMP Matrix Multiplcation Critical Section

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp critical
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I also tried using:
void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared(sum,i,j) private(k)
for (k = 0; k < N; k++) {
#pragma omp atomic
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
But that didn't work either. I also tried it without the second #pragma, and with:
void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.
EDIT: Here is a more complete version of the code:
double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;
void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
sum = 0;
for (k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
sum = 0;
#pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
for (k = 0; k < N; k++) {
sum = sum + A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
for(j=0; j < N; j++){
A[i][j] = i+j;
B[i][j] = i+j;
}
}
for(i=0; i < N; i++){
for(j=0; j < N; j++){
C[i][j] = 0;
}
}
if (CHOOSE == 0) {
matrixSequential();
}
else if(CHOOSE == 1) {
matrixParallel();
}
//checking for correctness
double sum;
for(i=0; i < N; i++){
sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}

Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.
Update
After you updated to provide a MVCE, #Zboson found the actual bug: you were declaring the arrays as double but adding them as int.

IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.
Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.
However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.
Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.