Splitting OpenMP threads on unbalanced tree

Splitting OpenMP threads on unbalanced tree - c

I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?

Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}

Related

OpenMP recursive tasks with shared data results in a huge slowdown

I am trying to implement a n-queens solver with OpenMP tasks. However, the game board is set in main function and I am giving it to a function.
So far, I have is:
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
// #pragma omp critical
// print_solution(board);
#pragma omp critical
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col))
{
int new_board[N][N];
board[i][col] = 1;
copy(board, new_board);
#pragma omp task firstprivate(col)
solve_NQueens(new_board, col + 1);
board[i][col] = 0;
}
}
return SOLUTION_EXISTS;
}
The initial call to this function in the main is:
#pragma omp parallel if(omp_get_num_threads() > 1)
{
#pragma omp single
{
#pragma omp taskgroup
{
solve_NQueens(board, 0);
}
}
}
Parameters:
// these are global
#define N 14
bool SOLUTION_EXISTS = false;
// these are in main
int board[N][N];
memset(board, 0, sizeof(board));
Compiler:
gcc
Number of threads: 4
I used taskgroup to wait all tasks before getting the result and I had to copy the game board for each task (which is a hard job when N is set to 14 since there are 356k solutions).
I tried to make board firstprivate or private, use taskwait inside and outside of the loop, use taskgroup inside the for loop and so on. I need some advice to optimize this logic.
Note: putting a taskgroup in the for loop under the if clause also helps, but this is much slower than expected.

First of all, there is a huge issue in your code: solve_NQueens can submit the tasks recursively and return before all the tasks are actually completed. You need to put a synchronization before the return so the value of SOLUTION_EXISTS will be valid (using either a #pragma omp taskwait or a #pragma omp taskgroup).
In terms of performance, there is multiple issues.
The main problem is that to many tasks are created: you create a task in each recursive call. While creating few tasks bring the needed parallelism, creating to much of them also introduces a significant overhead. This overhead can be much higher than the execution of the tail calls. A cut-off strategy to can be implemented to reduce the overhead: the general idea is to create tasks only for the first recursive calls. In your case, you can do it with a clause if(col < 3) at the end of the #pragma omp task. Please note that 3 is an arbitrary value, you may need to tune this threshold.
Moreover, board is copied (twice) during the task creation (since it is a static array and default variables required by an OpenMP task are implicitly copied). Your additional copy is not needed and the line board[i][col] = 0; is useless *if the code is compiled with the OpenMP support (otherwise pragma are ignored and this is not true*). However, the additional overhead introduced should not be critical if you fix the problem described above.

OpenMP unequal load without for loop

I have an OpenMP code that looks like the following
while(counter < MAX) {
#pragma omp parallel reduction(+:counter)
{
// do monte carlo stuff
// if a certain condition is met, counter is incremented
}
}
Hence, the idea is that the parallel section gets executed by the available threads as long as the counter is below a certain value. Depending on the scenario (I am doing MC stuff here, so it is random), the computations might take long than others, so that there is an imbalance between the workers here which becomes apparent because of the implicit barrier at the end of the parallel section.
It seems like #pragma omp parallel for might have ways to circumvent this (i.e. nowait directive/dynamic scheduling), but I can't use this, as I don't know an upper iteration number for the for loop.
Any ideas/design patterns how to deal with such a situation?
Best regards!

Run everything in a single parallel section and access the counter atomically.
int counter = 0;
#pragma omp parallel
while(1) {
int local_counter;
#pragma omp atomic read
local_counter = counter;
if (local_counter >= MAX) {
break;
}
// do monte carlo stuff
// if a certain condition is met, counter is incremented
if (certain_condition) {
#pragma omp atomic update
counter++;
}
}
You can't check directly in the while condition, because of the atomic access.
Note that this code will overshoot, i.e. counter > MAX is possible after the parallel section. Keep in mind that counter is shared and read/updated by many threads.

how to avoid overhead of openMP in nested loops

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?

An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.

You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

How to nest parallel loops in a sequential loop with OpenMP

I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.

Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.

In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.

Openmp: increase for loop iteration number

I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?

A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.