how to avoid overhead of openMP in nested loops - c

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?

An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.

You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

Related

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.
The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

Translating sequential code into openMP parallel construct

I have the following piece of code that I would like to write in openmp.
My code abstractly looks like the following
I start first with dividing N=100 iterations equally among p=10pieces and I store the allocated iterations for every piece in a vector
Nvec[1]={0,1,..,9}
Nvec[2]={10,11,..,19}
Nvec[p]={N-9,..,N}
then I loop on the iterations
for(k=0;k<p;k++){\\loop on each piece of Nvec
for(j=0;j<2;j++){\\here is a nested loop
for(i=Nvec[k][0];i<Nvec[k][p];i++){
\\then I loop between the first and
\\last value of the array corresponding to piece k
}
}
Now, as you can see the code is sequential with a total of 2*100=200 iterations, I wanted to parallelize it using OpenMp with the absolute condition to keep the order of iterations!
I tried the following
#pragma omp parallel for schedule(static) collapse(2)
{
for(j=0;j<2;j++){
for(i=0;i<n;i++){
\\loop code here
}
}
}
this setting doesn't keep the order of the iterations as in the sequential version.
In the sequential version, each chunk is processed entirely with j=0 and then entirely with j=1.
In my openMP version, every thread takes a chunk of iterations and process it entirely with j=0. In a way all threads treats either j=0 or j=1 cases. Every worker with p=10 processes 200/10=20 iterations, problem is all iterations are j=0 or j=1.
How can I make sure that every thread get a chunk of iterations, performs the loop code with j=0 on all the iterations, then j=1 on the same chunk of iterations?
EDIT
what I want exactly for every chunk of 20 iterations
worker 1
j:0
i:1--->10
j:1
i:1--->10
worker p
j:0
i:90--->99
j:1
i:90--->99
the openMP code above does
worker 1
j:0
i:1--->20
worker p
j:1
i:80--->99
It's actually simple - just make the outer j-loop non-worksharing:
#pragma omp parallel
for (int j = 0; j < 2; j++) {
#pragma omp for schedule(static)
for (int i = 0; i < 10; i++) {
...
}
}
If you use the static schedule, OpenMP guarantees, that each worker will get to handle the same range of is for both j=0 and j=1.
Note: You moving the parallel construct to the outer loop is merely an optimization to avoid thread management overhead. The code works similarly if you just place a parallel for in-between the two loops.

Splitting OpenMP threads on unbalanced tree

I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?
Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}

OpenMP average of an array

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
}
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.
First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
{
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
}
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
}
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

How to nest parallel loops in a sequential loop with OpenMP

I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.
Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.
In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.

Resources