I am new to the OpenMP, not sure what was wrong with this code, the results are not making sense.
Thanks.
#include <omp.h>
#include <stdio.h>
#define N 20
int cnt = 0;
int A[N];
int main (int argc, char *argv[]) {
#pragma omp parallel for
for (int i = 0; i <= N; i++) {
if ((i%2)==0) cnt++;
A[i] = cnt;
printf("i=%d, cnt=%d\n", i, cnt);
}
printf("outside the parallel cnt=%d\n", cnt);
for (int i = 0; i <= N; i++)
printf("A[%d]=%d\n", i, A[i]);
}
Edit:
the cnt outside the parallel region should be 11, most time it was correct, but sometime it gave me 10. For array A I understand why the values do not match with the indices, but I would hope the array A be like this following, is it possible ?
A[0]=1 A[1]=1 A[2]=2 A[3]=2 A[4]=3 A[5]=3 A[6]=4 A[7]=4 A[8]=5 A[9]=5 A[10]=6
A[11]=6 A[12]=7 A[13]=7 A[14]=8 A[15]=8 A[16]=9 A[17]=9 A[18]=10 A[19]=10
A[20]=11
Your code has multiple bugs. Let's address the silly one first. You write to N+1 elements but only allocate N elements. Change N to 21 and then change
for (int i = 0; i <= N; i++)
to
for (int i = 0; i < N; i++)
But your code has another more subtle bug. You're using an induction variable. I don't know an easy way to use induction variables with OpenMP.
In your case one easy fix is not use an induction variable and instead do
#pragma omp parallel for
for (int i = 0; i < N; i++) {
int j = i / 2 + 1;
A[i] = j;
}
cnt = N/2;
You can also use a reduction for the final value of cnt but it's redundant and less efficient.
#pragma omp parallel for reduction(+:cnt)
for (int i = 0; i < N; i++) {
if ((i % 2) == 0) cnt++;
int j = i / 2 + 1;
A[i] = j;
}
If you really want to use an induction variable then you have to do something like this:
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int start = ithread*N/nthreads;
int finish = (ithread + 1)*N/nthreads;
int j = start / 2;
if (start % 2) j++;
for (int i = start; i < finish; i++) {
if ((i % 2) == 0) j++;
A[i] = j;
}
}
cnt = N/2;
You can also use a reduction for the final value of cnt but as is clear in the code below it's redundant.
#pragma omp parallel reduction(+:cnt)
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int start = ithread*N/nthreads;
int finish = (ithread + 1)*N/nthreads;
int j = start / 2;
if (start % 2) j++;
for (int i = start; i <finish; i++) {
if ((i % 2) == 0) {
j++; cnt++;
}
A[i] = j;
}
}
Related
My code aims at combining two arrays into a third one, and doing this multiple times. I'd like to parallelize it using the means of openMP, but the result is not what I expect it to be. Namely, already the very first check breaks at a[0]!=0.
The MWE is:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int i = 0;
int j = 0;
long length = 10000;
long nSamples = 1000;
double *a, *b, *c;
double start = 0;
double end = 0;
int nthreads = 0;
// Array Allocation and Initialisation
a = (double *)malloc(length * sizeof(double));
b = (double *)malloc(length * sizeof(double));
c = (double *)malloc(length * sizeof(double));
for (i = 0; i < length; i++)
{
a[i] = 0.0;
b[i] = 0.0;
c[i] = 0.0;
}
// Get the maximum number of threads and evaluate the number of threads per group
#pragma omp parallel shared(nthreads)
{
nthreads = omp_get_num_threads() / 2;
}
#pragma omp parallel for shared(nthreads, a, b, c) num_threads(nthreads) collapse(2)
for (i = 0; i < nSamples; i++)
{
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
/*Check correctness*/
for (i = 0; i < length; i++)
{
if (a[i] != 2.0 * nSamples)
{
printf("a not equal at %d\n", i);
break;
}
if (b[i] != 2.0 * nSamples)
{
printf("b ot equal at %d\n", i);
break;
}
if (c[i] != 4.0 * nSamples)
{
printf("c ot equal at %d\n", i);
break;
}
}
free(a);
free(b);
free(c);
return 0;
}
If I only parallelize the inner loop, the check works fine. To achieve this, I use:
for (i = 0; i < nSamples; i++)
{
#pragma omp parallel for private(j) shared(a, b, c) num_threads(nthreads)
for (j = 0; j < length; j++)
{
a[j] = a[j] + 2.0;
b[j] = b[j] + 2.0;
c[j] = a[j] + b[j];
}
}
I know that openMP treats each iteration of a loop as independent of other iterations. I assume this causes e.g. iteration i=5 on thread 1 to write into a[j] of iteration i=9 of another thread, causing the difference. Is this correct? Can this be avoided, using the means of openMP?
I have a function that I want to parallelize. This is the serial version.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j;
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
b[irem[j] - 1] += xrem[j]*x[i];
}
}
}
I figured a decent way to do this was to have each thread write to a private copy of the b array (which does not need to be a protected critical section because its a private copy), after the thread is done, it will then copy its results to the actual b array. Here is my code.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j, k;
#pragma omp parallel private(i, j, k)
{
float* b_local = (float*)malloc(sizeof(b));
#pragma omp for nowait
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[index] += current_add;
}
}
for (k = 0; k < sizeof(b) / sizeof(b[0]); k++)
{
// Separate question: Is this if statement allowed?
//if (b_local[k] == 0) { continue; }
#pragma omp atomic
b[k] += b_local[k];
}
}
}
However, I get a segmentation fault as a result of the second for loop. I do not need to a "#pragma omp for" on that loop because I want each thread to execute it fully. If I comment out the content inside the for loop, no segmentation fault. I am not sure what the issue would be.
You're probabily trying to access an out of range position in the dynamic array b_local.
See that sizeof(b) will return the size in bytes of float* (size of a float pointer).
If you want to know the size of the array that you are passing to the function, i would suggest you add it to the parameters of the function.
void parallelCSC_SpMV(float *x, float *b, int b_size){
...
float* b_local = (float*) malloc(sizeof(float)*b_size);
...
}
And, if the size of colptrs is numcols i would be careful with colptrs[i+1], since when i=numcols-1 will have another out of range problem.
First, as pointed out by Jim Cownie:
In all of these answers, b_local is uninitialised, yet you are adding
to it. You need to use calloc instead of malloc
Just to add to the accepted answer, I thing you can try the following approach to avoid calling malloc in parallel, and also the overhead of calling #pragma omp atomic.
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads];
for(int i = 0; i < num_threads; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[tid][index] += current_add;
}
}
}
for(int id = 0; id < num_threads; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I have not tested the performance of this, so please feel free to do so and provide feedback.
You can further optimize by instead of creating a local_b for the master thread just reused the original b, as follows:
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads-1];
for(int i = 0; i < num_threads-1; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
float *thread_b = (tid == 0) ? b : b_local[tid-1];
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
thread_b[index] += current_add;
}
}
}
for(int id = 0; id < num_threads-1; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I am currently new to OpenMp and trying to write a simple OpenMP-C matrix-vector multiplication program. On increasing the matrix size to 750x750 elements, my program stops responding and the window hangs. I would like to know if that is a limitation of my laptop or is it a data-race condition I am facing.
I am trying to define a matrix A and a vector u and put random elements (0-10). Then I am calculating the vector result b.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x_range = 50;
int y_range = 50;
int A[x_range][y_range];
int u[y_range];
int b[y_range];
printf("Measuring time resolution %g\n", omp_get_wtick());
printf("Parallel program start time %g\n", omp_get_wtime());
#pragma omp parallel num_threads(x_range)
{
int b_temp[y_range];
for (int j = 0; j < y_range; j++)
{
b_temp[j] = 0;
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for (int j = 0; j < y_range; j++)
{
A[i][j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int j = 0; j < y_range; j++)
{
{
u[j] = (rand() % 10) + 1;
}
}
#pragma omp for
for (int i = 0; i < x_range; i++)
{
for(int j = 0; j < y_range; j++)
{
b_temp[i] = b_temp[i] + A[i][j]*u[j];
}
}
#pragma omp critical
for(int j = 0; j < y_range; j++)
{
b[j] = b[j] + b_temp[j];
}
}
printf("parallel program end time %g\n", omp_get_wtime());
return 0;
}
First off, operations you're performing cannot have data race conditions, because there's no RAW , WAR , WAW dependency. You can read more about them in wiki.
Secondly, Your system is hanging because you're creating 750 threads as dictated by x_range
I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.
I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( int argc , char *argv[] )
{
srand(time(0));
if ( argc != 2 )
{
printf("Usage: %s <size of nxn matrices>\n", argv[0]);
return 1;
}
int n = atoi( argv[1] );
int a, b;
double A[n][n], B[n][n], C[n][n];
FILE *fp;
fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine
for(a = 0; a < n; a++)
{
for(b = 0; b < n; b++)
{
A[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
A[a][b] = (double)rand(); //Number between 0 and RAND_MAX
B[a][b] = ((double)rand()/(double)RAND_MAX); //Number between 0 and 1
B[a][b] = (double)rand(); //Number between 0 and RAND_MAX
C[a][b] = 0.0;
}
}
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
#pragma omp for schedule(guided,n)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
double sum = 0;
for(k = 0; k < n; ++k)
{
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
fprintf(fp,"0.4lf",C[i][j]);
}
}
}
if(fp)
{
fclose(fp);
}
fp = NULL;
return 0;
}
(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.
(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.
(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.
for(j = 0; j < n; ++j)
{
for(k = 0; k < n; ++k)
{
B[k][j] += 5.0;
}
}
You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
B[k][j] += 5.0;
}
}
Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sums you're computing and then saving them all once you're done with your current loop.
Combining all 3 steps, we get something like:
#pragma omp parallel shared(A,B,C)
{
int i,j,k;
double sum[n]; // one for each j
#pragma omp for schedule(static)
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
sum[j] = 0;
for(k = 0; k < n; ++k)
{
for(j = 0; j < n; ++j)
{
sum[j] += A[i][k] * B[k][j];
}
}
for(j = 0; j < n; ++j)
C[i][j] = sum[j];
}
}
// perform I/O here using contiguous blocks of C variable
Hope that helps.
EDIT: As per #Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.