why is nested OpenMP program is taking more time in executing? - c

My OpenMP program of matrix multiplication which consists of nesting of for loops is taking more execution time than the non-nested version of the parallel program. This is the block where I have used nested parallelisation.
pragma omp parallel
omp_set_nested(1);
#pragma omp parallel for
for(i=0;i<N;i++) {
#pragma omp parallel for
for(j=0;j<N;j++) {
C[i][j]=0.; // set initial value of resulting matrix C = 0
#pragma omp parallel for
for(m=0;m<N;m++) {
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}

Related

Using omp_get_num_threads() inside the parallel section

I would like to set the number of threads in OpenMP. When I use
omp_set_num_threads(2);
printf("nthread = %d\n", omp_get_num_threads());
#pragma omp parallel for
...
I see nthreads=1. Explained here, the number of reported threads belongs to serial section which is always 1. However, when I move it to the line after #pragma, I get compilation error that after #pragma, for is expected. So, how can I fix that?
Well, yeah, omp parallel for expects a loop in the next line. You can call omp_get_num_threads inside that loop. Outside a parallel section, you can call omp_get_max_threads for the maximum number of threads to spawn. This is what you are looking for.
int max_threads = omp_get_max_threads();
#pragma omp parallel for
for(...) {
int current_threads = omp_get_num_threads();
assert(current_threads == max_threads);
}
#pragma omp parallel
{
int current_threads = omp_get_num_threads();
# pragma omp for
for(...) {
...
}
}

Why does the result of #pragma omp parallel look like this

#include<stdio.h>
int main()
{
int a=0, b=0;
#pragma omp parallel num_threads(16)
{
// #pragma omp single
a++;
// #pragma omp critical
b++;
}
printf("single: %d -- critical: %d\n", a, b);
}
Why is my output single: 5 -- critical: 5 here?
And why is it output single: 3 -- critical: 3 when num_threads(4)?
I should not code anything like this, right? If the threads are confused here (I guess), why the result is consistent?
You have race conditions, therefore the values of a and b are undefined. To correct it you can
a) use reduction (it is the best solution):
#pragma omp parallel num_threads(16) reduction(+:a,b)
b) use atomic operations (generally it is less efficient):
#pragma omp atomic
a++;
#pragma omp atomic
b++;
c) use critical section (i.e. locks, which is the worst solution):
#pargma omp critical
{
a++;
b++;
}
That's just pure luck, helped by inherent timing of the operation. Since this is the first parallel section in your program, threads are built instead of reused. That occupies the main thread for a while and results in threads starting sequentially. Therefore the chance of overlapping operations is lower. Try adding a #pragma omp barrier in between to increase the chance of race conditions becoming apparent.

OpenMP minimum value array

I have the original code:
min = INT_MAX;
for (i=0;i<N;i++)
if (A[i]<min)
min = A[i];
for (i=0;i<N;i++)
A[i]=A[i]-min;
I want to get the parallel version of this and I did this:
min = INT_MAX;
#pragma omp parallel private(i){
minl = INT_MAX;
#pragma omp for
for (i=0;i<N;i++)
if (A[i]<minl)
minl=A[i];
#pragma omp critical{
if (minl<min)
min=minl;
}
#pragma omp for
for (i=0;i<N;i++)
A[i]=A[i]-min;
}
Is the parallel code right? I was wondering if it is necessary to write #pragma omp barrier before #pragma omp critical so that I make sure that all the minimums are calculated before calculating the global minimum.
The code is correct. There is no necessity to add a #pragma omp barrier because there is no need for all min_l to be computed when one thread enters the critical section. There is also an implicit barrier at the end of the loop region.
Further, you do not necessarily need to explicitly declare the loop iteration variable i private.
You can improve the code by using a reduction instead of your manual merging of minl:
#pragma omp for reduction(min:min)
for (i=0;i<N;i++)
if (A[i]<min)
min=A[i];
Note: The min operator for reduction is available since OpenMP 3.1.

Openmp-for loop parallelization

I have a algorithm in which I have made groups of structures(ie. array of structures). I want that each group should work in a single thread. I am giving the code as follows. the for loop following #pragma omp for, i want i=0 should be executed in one thread, i=1 in another and so on. Kindly help and suggest me if I am doing correct.
#pragma omp parallel shared(min,sgb,div,i) private(th_id)
omp_set_num_threads(4);
{
th_id=omp_get_thread_num();
printf("Thread %d\n",th_id);
scanf("%c",&ch);
#pragma omp for schedule(static,CHUNKSIZE)
for(i=0;i<div;i++)
{
sgb[i]=pso(sgb[i],kmax,c1,c2);
min[i]=sgb[i].gbest;
printf("in distribute gbest=%f x=%f y=%f index=%d\n",sgb[i].gbest,sgb[i].bestp[0],sgb[i].bestp[1],sgb[i].index);
}
#pragma omp barrier
//fclose(fp);
m=min[0];
for(j=0;j<div;j++)
{
printf("after barrier gbest=%f x=%f y=%f\n",sgb[j].gbest,sgb[j].bestp[0],sgb[j].bestp[1]);
if(m>min[j])
{
m=min[j];
k=j;
}
}
}

Can "#pragma omp parallel for " be used inside a loop?

Can "#pragma omp parallel for" be used inside a loop in the following form:
For (i=0;i<...;.i+=1)
{ #pragma omp parallel for
for(j=0;j<...;j+=1)
{ Some code.....}
Will this just parralelize the loop on 'j' ?
Thanks on advence !
Yes, it can be used like that. But compiler directives have to be on a line of their own
for( ... )
{
#pragma omp parallel for
for( ...
//..
Also, this will indeed only execute the inner loop in parallel. If you need both loops to execute in parallel you need a second #pragma omp parallel for above the outer loop.
It can be used like you said but this is not something good to have if I were you and want to avoid fork/join with each iteration I would do it like this:
#pragma omp parallel for
for(j loop)
{
for(i loop)
{
//Modify your code if necessary.
}
}
in the above snippet if your maximum thread count is 10 and your j length is 100 then there will be 10 threads each will run 10 iterations of j.

Resources