OpenMP recursive tasks with shared data results in a huge slowdown - c

I am trying to implement a n-queens solver with OpenMP tasks. However, the game board is set in main function and I am giving it to a function.
So far, I have is:
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
// #pragma omp critical
// print_solution(board);
#pragma omp critical
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col))
{
int new_board[N][N];
board[i][col] = 1;
copy(board, new_board);
#pragma omp task firstprivate(col)
solve_NQueens(new_board, col + 1);
board[i][col] = 0;
}
}
return SOLUTION_EXISTS;
}
The initial call to this function in the main is:
#pragma omp parallel if(omp_get_num_threads() > 1)
{
#pragma omp single
{
#pragma omp taskgroup
{
solve_NQueens(board, 0);
}
}
}
Parameters:
// these are global
#define N 14
bool SOLUTION_EXISTS = false;
// these are in main
int board[N][N];
memset(board, 0, sizeof(board));
Compiler:
gcc
Number of threads: 4
I used taskgroup to wait all tasks before getting the result and I had to copy the game board for each task (which is a hard job when N is set to 14 since there are 356k solutions).
I tried to make board firstprivate or private, use taskwait inside and outside of the loop, use taskgroup inside the for loop and so on. I need some advice to optimize this logic.
Note: putting a taskgroup in the for loop under the if clause also helps, but this is much slower than expected.

First of all, there is a huge issue in your code: solve_NQueens can submit the tasks recursively and return before all the tasks are actually completed. You need to put a synchronization before the return so the value of SOLUTION_EXISTS will be valid (using either a #pragma omp taskwait or a #pragma omp taskgroup).
In terms of performance, there is multiple issues.
The main problem is that to many tasks are created: you create a task in each recursive call. While creating few tasks bring the needed parallelism, creating to much of them also introduces a significant overhead. This overhead can be much higher than the execution of the tail calls. A cut-off strategy to can be implemented to reduce the overhead: the general idea is to create tasks only for the first recursive calls. In your case, you can do it with a clause if(col < 3) at the end of the #pragma omp task. Please note that 3 is an arbitrary value, you may need to tune this threshold.
Moreover, board is copied (twice) during the task creation (since it is a static array and default variables required by an OpenMP task are implicitly copied). Your additional copy is not needed and the line board[i][col] = 0; is useless *if the code is compiled with the OpenMP support (otherwise pragma are ignored and this is not true*). However, the additional overhead introduced should not be critical if you fix the problem described above.

Related

How to run a static parallel for loop without the main thread

I want to execute a funtion with multithreads, without using main thread. So this is what I want:
# pragma omp parallel num_threads(9)
{
// do something
# pragma omp for schedule(static,1)
for(int i = 0; i < 10; i++)
func(i); // random stuff
}
So I want func() to be executed just by 8 threads, without main thread. Is that possible somehow?
So I want func() to be executed just by 8 threads, without main
thread. Is that possible somehow?
Yes, you can do it. However, you will have to implement the functionality of
#pragma omp for schedule(static,1)
since, explicitly using the aforementioned clause will make the compiler automatically divide the iterations of the loop among the threads in the team, including the master thread of that team, which in your code example will be also the main thread. The code could look like the following:
# pragma omp parallel num_threads(9)
{
// do something
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
if(thread_id != 0) // all threads but the master thread
{
thread_id--; // shift all the ids
total_threads = total_threads - 1;
for(int i = thread_id ; i < 10; i += total_threads)
func(i); // random stuff
}
#pragma omp barrier
}
First, we ensure that all threads except the master executed the loop to be parallelized (i.e., if(thread_id != 0)), then we divided the iterations of the loop among the remaining threads (i.e., for(int i = thread_id ; i < 10; i += total_threads)), and finally we ensure that all threads wait for each other at the end of the parallel region (i.e., #pragma omp barrier).
If it isn't important which thread doesn't do the loop, another option would be to combine sections with the loop. This means nesting parallelism, which one should be very careful with, but it should work:
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{ /* work for one thread */ }
#pragma omp section
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
The main problem here is, that most likely one of those sections will be taking much longer than the other one, meaning that in the worst case (loop faster than first section) all but one thread are doing nothing most of the time.
If you really need the master thread to be outside the parallel region this might work (not tested):
#pragma omp parallel num_threads(2)
{
#pragma omp master
{ /* work for master thread, other thread is NOT waiting */ }
#pragma omp single
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
There is no guarantee that the master thread wont be computing the single region as well, but if your cores aren't over-occupied it should at least be unlikely. One could even argue that if the second thread from the outer parallel region doesn't reach the single region in time, it is better that the master thread also has a chance of going in there, even if that means, that the second thread doesn't get anything to do.
As the single region should only have an implicit barrier at it's end, while the master region doesn't contain any implicit barriers, they should potentially be executed in parallel as longs as the master region is in front of the single region. This assumes that the single region is well-implemented, such that every thread has a chance of computing it. This isn't guaranteed by the standard, I think.
EDIT:
These solutions require nested parallelism to work, which is disabled by default in most implementations. It can be activated via the environment variable OMP_NESTED or by calling omp_set_nested().

Recursive task creation results in segmentation fault in OpenMP

I am trying to implement a nqueens solver with OpenMP, my serial code works fine but when I try to do task parallelism on that, I get segmentation fault or empty rows/cols.
Here is my implementation:
#define N 8
bool SOLUTION_EXISTS = false; // THIS IS GLOBAL
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
#pragma omp critical
print_solution(board);
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col) )
{
#pragma omp taskgroup
{
#pragma omp task private(col) shared(i) firstprivate(board)
{
board[i][col] = 1;
SOLUTION_EXISTS = solve_NQueens(board, col + 1) || SOLUTION_EXISTS;
board[i][col] = 0;
}
}
}
}
return SOLUTION_EXISTS;
}
And the first call to this function is:
#pragma omp parallel
{
#pragma omp single
{
solve_NQueens(board, 0);
}
}
When I make col private, it gives a segmentation fault. If I do not put any variable scope, ambiguous and wrong solutions are printed.
And I am using gcc 4.8.5
Solution
There is a segmentation fault because you use private(col). Thus, col is not copied from your function and not even initialized. Use firstprivate(col) to make a proper copy of col.
Advise
omp taskgroup will make your code run in sequential since there is an implicit barrier at the end of the scope. It is probably better to avoid it (eg. by using an omp taskwait at the end of the loop and changing a bit the rest of the code).
If you want to change that, please note that i must be copied using a firstprivate rather than shared.
Moreover, avoid using global variables like SOLUTION_EXISTS in a parallel code. This generally cause a lot of issues from vicious bugs to slow codes. And if you still need/want to do it, the variables used in multiple threads must be protected using for example omp atomic or omp critical directives.

OpenMP unequal load without for loop

I have an OpenMP code that looks like the following
while(counter < MAX) {
#pragma omp parallel reduction(+:counter)
{
// do monte carlo stuff
// if a certain condition is met, counter is incremented
}
}
Hence, the idea is that the parallel section gets executed by the available threads as long as the counter is below a certain value. Depending on the scenario (I am doing MC stuff here, so it is random), the computations might take long than others, so that there is an imbalance between the workers here which becomes apparent because of the implicit barrier at the end of the parallel section.
It seems like #pragma omp parallel for might have ways to circumvent this (i.e. nowait directive/dynamic scheduling), but I can't use this, as I don't know an upper iteration number for the for loop.
Any ideas/design patterns how to deal with such a situation?
Best regards!
Run everything in a single parallel section and access the counter atomically.
int counter = 0;
#pragma omp parallel
while(1) {
int local_counter;
#pragma omp atomic read
local_counter = counter;
if (local_counter >= MAX) {
break;
}
// do monte carlo stuff
// if a certain condition is met, counter is incremented
if (certain_condition) {
#pragma omp atomic update
counter++;
}
}
You can't check directly in the while condition, because of the atomic access.
Note that this code will overshoot, i.e. counter > MAX is possible after the parallel section. Keep in mind that counter is shared and read/updated by many threads.

Splitting OpenMP threads on unbalanced tree

I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?
Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}

Partially parallel loops using openmp tasks

Prerequisites:
parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)
Input:
C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition
Code sample:
(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value)
{
int conf;
conf = prepare(config); // independent, does not change “config”
*value = process(conf, *value); // dependent, takes prev., produce next
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
…
for (int i = 0; i < N; i++)
{
loopFunc(configData, &valueData);
}
…
}
Need to:
parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency
What have been proposed and implemented:
define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready
Like this:
(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
int conf;
conf = prepare(config); // independent, do not change “config”
while (*parSync != iteration) // wait for previous to be ready
{
#pragma omp taskyield
}
*value = process(conf, *value); // dependent, takes prev., produce next
*parSync = iteration + 1; // inform next about readiness
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
…
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
{
#pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
loopFunc(configData, &valueData, &parallelSync, i);
}
#pragma omp taskwait
…
}
What happened:
It fails. :)
The reason was that openmp task occupies openmp thread.
For example, if we define 5 openmp threads (as in the code above).
“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.
If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.
What's next?
I have no other ideas how to implement the required mode of computation.
Current solution
Thanks for the idea to #parallelgeek below
int main()
{
int N = 10;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
int workers;
volatile int workingTasks = 0;
...
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
{
workers = omp_get_num_threads()-1; // reserve 1 thread for task generation
for (int i = 0; i < N; i++)
{
while (workingTasks >= workers)
{
#pragma omp taskyield
}
#pragma omp atomic update
workingTasks++;
#pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
{
loopFunc(configData, &valueData, &parallelSync, i);
#pragma omp atomic update
workingTasks--;
}
}
#pragma omp taskwait
}
}
AFAIK volatiles don't prevent hardware reordering, that's why you
could end up with a mess in memory, because data is not written yet,
while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.

Resources