Recursive task creation results in segmentation fault in OpenMP

Recursive task creation results in segmentation fault in OpenMP - c

I am trying to implement a nqueens solver with OpenMP, my serial code works fine but when I try to do task parallelism on that, I get segmentation fault or empty rows/cols.
Here is my implementation:
#define N 8
bool SOLUTION_EXISTS = false; // THIS IS GLOBAL
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
#pragma omp critical
print_solution(board);
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col) )
{
#pragma omp taskgroup
{
#pragma omp task private(col) shared(i) firstprivate(board)
{
board[i][col] = 1;
SOLUTION_EXISTS = solve_NQueens(board, col + 1) || SOLUTION_EXISTS;
board[i][col] = 0;
}
}
}
}
return SOLUTION_EXISTS;
}
And the first call to this function is:
#pragma omp parallel
{
#pragma omp single
{
solve_NQueens(board, 0);
}
}
When I make col private, it gives a segmentation fault. If I do not put any variable scope, ambiguous and wrong solutions are printed.
And I am using gcc 4.8.5

Solution
There is a segmentation fault because you use private(col). Thus, col is not copied from your function and not even initialized. Use firstprivate(col) to make a proper copy of col.
Advise
omp taskgroup will make your code run in sequential since there is an implicit barrier at the end of the scope. It is probably better to avoid it (eg. by using an omp taskwait at the end of the loop and changing a bit the rest of the code).
If you want to change that, please note that i must be copied using a firstprivate rather than shared.
Moreover, avoid using global variables like SOLUTION_EXISTS in a parallel code. This generally cause a lot of issues from vicious bugs to slow codes. And if you still need/want to do it, the variables used in multiple threads must be protected using for example omp atomic or omp critical directives.

Related

OpenMP recursive tasks with shared data results in a huge slowdown

I am trying to implement a n-queens solver with OpenMP tasks. However, the game board is set in main function and I am giving it to a function.
So far, I have is:
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
// #pragma omp critical
// print_solution(board);
#pragma omp critical
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col))
{
int new_board[N][N];
board[i][col] = 1;
copy(board, new_board);
#pragma omp task firstprivate(col)
solve_NQueens(new_board, col + 1);
board[i][col] = 0;
}
}
return SOLUTION_EXISTS;
}
The initial call to this function in the main is:
#pragma omp parallel if(omp_get_num_threads() > 1)
{
#pragma omp single
{
#pragma omp taskgroup
{
solve_NQueens(board, 0);
}
}
}
Parameters:
// these are global
#define N 14
bool SOLUTION_EXISTS = false;
// these are in main
int board[N][N];
memset(board, 0, sizeof(board));
Compiler:
gcc
Number of threads: 4
I used taskgroup to wait all tasks before getting the result and I had to copy the game board for each task (which is a hard job when N is set to 14 since there are 356k solutions).
I tried to make board firstprivate or private, use taskwait inside and outside of the loop, use taskgroup inside the for loop and so on. I need some advice to optimize this logic.
Note: putting a taskgroup in the for loop under the if clause also helps, but this is much slower than expected.

First of all, there is a huge issue in your code: solve_NQueens can submit the tasks recursively and return before all the tasks are actually completed. You need to put a synchronization before the return so the value of SOLUTION_EXISTS will be valid (using either a #pragma omp taskwait or a #pragma omp taskgroup).
In terms of performance, there is multiple issues.
The main problem is that to many tasks are created: you create a task in each recursive call. While creating few tasks bring the needed parallelism, creating to much of them also introduces a significant overhead. This overhead can be much higher than the execution of the tail calls. A cut-off strategy to can be implemented to reduce the overhead: the general idea is to create tasks only for the first recursive calls. In your case, you can do it with a clause if(col < 3) at the end of the #pragma omp task. Please note that 3 is an arbitrary value, you may need to tune this threshold.
Moreover, board is copied (twice) during the task creation (since it is a static array and default variables required by an OpenMP task are implicitly copied). Your additional copy is not needed and the line board[i][col] = 0; is useless *if the code is compiled with the OpenMP support (otherwise pragma are ignored and this is not true*). However, the additional overhead introduced should not be critical if you fix the problem described above.

Is it beneficial to parallelize variable declaration?

I wonder if it is beneficial when writing a parallel program to insert variables declarations into the parallel section? Because the Amdahl's law says that if more portion of the program is parallel it's better but I don't see the point to parallelize variables declaration and return statements, for example, this is the normal parallel code:
#include <omp.h>
int main(void) {
int a = 0;
int b[5];
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < 5; ++i) {
b[i] = a;
}
}
return 0;
}
Will it be beneficial regarding Amdahl's law to write this (so 100% of the program is parallel):
#include <omp.h>
int main(void) {
#pragma omp parallel
{
int a = 0;
int b[5];
#pragma omp for
for (int i = 0; i < 5; ++i) {
b[i] = a;
}
return 0;
}
}

These codes are not equivalent: in the first case, a and b are shared variables (since shared is the default behavior for variables), in the second case these are thread-private variables that do not exist beyond the scope of the parallel region.
Besides, the return statement within the parallel region in the second piece of code is illegal and must cause a compilation error.
As seen for instance in this OpenMP 4.0 reference card
An OpenMP executable directive applies to the succeeding structured
block or an OpenMP construct. Each directive starts with #pragma omp.
The remainder of the directive follows the conventions of the C and
C++ standards for compiler directives. A structured-block is a single
statement or a compound statement with a single entry at the top and a
single exit at the bottom.
A block that contains the return statement is not a structured-block since it does not have a single exit at the bottom (i.e. the closing brace } is not the only exit since return is another one). It may not legally follow the #pragma omp parallel directive.

openmp data dependency/ wait for var

How to solve following problem with OpenMP:
Edit: Panni is right, my parallel approach doesn't work, so what's a nice way to do this?
Situation:
Function call order:
calc();
move();
function calc() parallelizes a loop via #pragma omp parallel for and changes a array containing structs with x and y values.
function move() parallelizes a loop via #pragma omp for only and accesses the values changed in calc().
Question:
How can I make sure in OpenMP that function 1 has set the values, before I access function 2?
More specific:
#include <...>
static b_t *b;
static void calc() {
#pragma omp parallel for private(j)
for(i = 0; i < n - 1; i++)
for(j = i + 1; j < n; j++) {
b[i].f.x += some_value;
b[i].f.y += some_value;
b[j].f.x -= some_value;
b[j].f.y -= some_value;
}
}
}
static move() {
#pragma omp for
for (i = 0; i < n; i++) {
d.x = b[i].f.x;
d.y = b[i].f.y;
}
}
int main() {
for (i = 0; t < end; t += dt) {
calc();
move();
}
}
So, how can I make sure move() gets only called after calc()? I thought about #pragma omp task depend(in/out) but it doesn't seem to work for me.

Since you're calling the function calc() and move() in sequential order the calc() function will always be executed first.
So the program runs calc() waits until it is completed (no matter how many threads are used) and then continues to run move()

Panni has said exactly what i want to say.
It sucks I still can't comment.
And I don't want to answer for simply like this.
So I guess you may suffer a performance issue for two reasons.
imbalanced workload scheduling on a triangular.
cache miss problem if cache can't hold the whole b.
move func looks weird to me, so I didn't take it into consideration.

Counting does not work properly in OpenMP

I have the function
void collatz(int startNumber, int endNumber, int* iter, int nThreads)
{
int i, n, counter;
int isodd; /* 1 if n is odd, 0 if even */
#pragma omp parallel for
for (i = startNumber; i <= endNumber; i++)
{
counter = 0;
n = i;
omp_set_num_threads(nThreads);
while (n > 1)
{
isodd = n%2;
if (isodd)
n = 3*n+1;
else
n/=2;
counter++;
}
iter[i - startNumber] = counter;
}
}
It works as I wish when running serial (i.e. compiling without OpenMP or commenting out #pragma omp parallel for and omp_set_num_threads(nThreads);). However, the parallel version produces the wrong result and I think it is because the counter variable need to be set to zero at the beginning of each for loop and perhaps another thread can work with the non-zeroed counter value. But even if I use #pragma omp parallel for private(counter), the problem still occurs. What am I missing?
I compile the program as C89.

Inside your OpenMP parallel region, you are assigning values to the counter, n and isodd scalar variables. These cannot therefore be just shared as they are by default. You need to pay extra attention to them.
A quick analysis shows that as their values is only meaningful inside the parallel region and only for the current thread, so it becomes clear that they need to be declared private.
Adding a private( counter, n, isodd ) clause to your #pragma omp parallel directive should fix the issue.

Partially parallel loops using openmp tasks

Prerequisites:
parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)
Input:
C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition
Code sample:
(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value)
{
int conf;
conf = prepare(config); // independent, does not change “config”
*value = process(conf, *value); // dependent, takes prev., produce next
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
…
for (int i = 0; i < N; i++)
{
loopFunc(configData, &valueData);
}
…
}
Need to:
parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency
What have been proposed and implemented:
define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready
Like this:
(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
int conf;
conf = prepare(config); // independent, do not change “config”
while (*parSync != iteration) // wait for previous to be ready
{
#pragma omp taskyield
}
*value = process(conf, *value); // dependent, takes prev., produce next
*parSync = iteration + 1; // inform next about readiness
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
…
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
{
#pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
loopFunc(configData, &valueData, &parallelSync, i);
}
#pragma omp taskwait
…
}
What happened:
It fails. :)
The reason was that openmp task occupies openmp thread.
For example, if we define 5 openmp threads (as in the code above).
“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.
If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.
What's next?
I have no other ideas how to implement the required mode of computation.
Current solution
Thanks for the idea to #parallelgeek below
int main()
{
int N = 10;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
int workers;
volatile int workingTasks = 0;
...
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
{
workers = omp_get_num_threads()-1; // reserve 1 thread for task generation
for (int i = 0; i < N; i++)
{
while (workingTasks >= workers)
{
#pragma omp taskyield
}
#pragma omp atomic update
workingTasks++;
#pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
{
loopFunc(configData, &valueData, &parallelSync, i);
#pragma omp atomic update
workingTasks--;
}
}
#pragma omp taskwait
}
}

AFAIK volatiles don't prevent hardware reordering, that's why you
could end up with a mess in memory, because data is not written yet,
while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.