Partially parallel loops using openmp tasks - c

Prerequisites:
parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)
Input:
C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition
Code sample:
(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value)
{
int conf;
conf = prepare(config); // independent, does not change “config”
*value = process(conf, *value); // dependent, takes prev., produce next
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
…
for (int i = 0; i < N; i++)
{
loopFunc(configData, &valueData);
}
…
}
Need to:
parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency
What have been proposed and implemented:
define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready
Like this:
(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
int conf;
conf = prepare(config); // independent, do not change “config”
while (*parSync != iteration) // wait for previous to be ready
{
#pragma omp taskyield
}
*value = process(conf, *value); // dependent, takes prev., produce next
*parSync = iteration + 1; // inform next about readiness
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
…
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
{
#pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
loopFunc(configData, &valueData, &parallelSync, i);
}
#pragma omp taskwait
…
}
What happened:
It fails. :)
The reason was that openmp task occupies openmp thread.
For example, if we define 5 openmp threads (as in the code above).
“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.
If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.
What's next?
I have no other ideas how to implement the required mode of computation.
Current solution
Thanks for the idea to #parallelgeek below
int main()
{
int N = 10;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
int workers;
volatile int workingTasks = 0;
...
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
{
workers = omp_get_num_threads()-1; // reserve 1 thread for task generation
for (int i = 0; i < N; i++)
{
while (workingTasks >= workers)
{
#pragma omp taskyield
}
#pragma omp atomic update
workingTasks++;
#pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
{
loopFunc(configData, &valueData, &parallelSync, i);
#pragma omp atomic update
workingTasks--;
}
}
#pragma omp taskwait
}
}

AFAIK volatiles don't prevent hardware reordering, that's why you
could end up with a mess in memory, because data is not written yet,
while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.

Related

OpenMP recursive tasks with shared data results in a huge slowdown

I am trying to implement a n-queens solver with OpenMP tasks. However, the game board is set in main function and I am giving it to a function.
So far, I have is:
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
// #pragma omp critical
// print_solution(board);
#pragma omp critical
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col))
{
int new_board[N][N];
board[i][col] = 1;
copy(board, new_board);
#pragma omp task firstprivate(col)
solve_NQueens(new_board, col + 1);
board[i][col] = 0;
}
}
return SOLUTION_EXISTS;
}
The initial call to this function in the main is:
#pragma omp parallel if(omp_get_num_threads() > 1)
{
#pragma omp single
{
#pragma omp taskgroup
{
solve_NQueens(board, 0);
}
}
}
Parameters:
// these are global
#define N 14
bool SOLUTION_EXISTS = false;
// these are in main
int board[N][N];
memset(board, 0, sizeof(board));
Compiler:
gcc
Number of threads: 4
I used taskgroup to wait all tasks before getting the result and I had to copy the game board for each task (which is a hard job when N is set to 14 since there are 356k solutions).
I tried to make board firstprivate or private, use taskwait inside and outside of the loop, use taskgroup inside the for loop and so on. I need some advice to optimize this logic.
Note: putting a taskgroup in the for loop under the if clause also helps, but this is much slower than expected.
First of all, there is a huge issue in your code: solve_NQueens can submit the tasks recursively and return before all the tasks are actually completed. You need to put a synchronization before the return so the value of SOLUTION_EXISTS will be valid (using either a #pragma omp taskwait or a #pragma omp taskgroup).
In terms of performance, there is multiple issues.
The main problem is that to many tasks are created: you create a task in each recursive call. While creating few tasks bring the needed parallelism, creating to much of them also introduces a significant overhead. This overhead can be much higher than the execution of the tail calls. A cut-off strategy to can be implemented to reduce the overhead: the general idea is to create tasks only for the first recursive calls. In your case, you can do it with a clause if(col < 3) at the end of the #pragma omp task. Please note that 3 is an arbitrary value, you may need to tune this threshold.
Moreover, board is copied (twice) during the task creation (since it is a static array and default variables required by an OpenMP task are implicitly copied). Your additional copy is not needed and the line board[i][col] = 0; is useless *if the code is compiled with the OpenMP support (otherwise pragma are ignored and this is not true*). However, the additional overhead introduced should not be critical if you fix the problem described above.

Is a function without loop parallelizable?

considering the code below, can we consider it parallel even if there are no loops?
#include <omp.h>
int main(void) {
#pragma omp parallel
{
int a = 1;
a = 0;
}
return 0;
}
Direct Answer:
Yes, here, the section of your code,
int a = 1;
a = 0;
Runs in parallel, P times, where P is the number of cores on your machine.
For example on a four core machine, the following code (with the relevant imports),
int main(void) {
#pragma omp parallel
{
printf("Thread number %d", omp_get_thread_num());
}
return 0;
}
would output:
Thread number 0
Thread number 1
Thread number 2
Thread number 3
Note that when running in parallel, there is no guarantee on the order of the output, so the output could just as likely be something like:
Thread number 1
Thread number 2
Thread number 0
Thread number 3
Additionally, if you wanted to specify the number of threads used in the parallel region, instead of #pragma omp parallel you could write, #pragma omp parallel num_threads(4).
Further Explanation:
If you are still confused, it may be helpful to better understand the difference between parallel for loops and parallel code regions.
#pragma omp parallel tells the compiler that the following code block may be executed in parallel. It guarantees that all code within the parallel region will have finished execution before continuing to subsequent code.
In the following (toy) example, the programmer is guaranteed that after the parallel region, the array will have all entries set to zero.
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P)
{
int local_start = omp_get_thread_num();
int local_end = local_start + (100 / P);
for (int i = local_start; i < local_end; ++i) {
arr[i] = 0;
}
}
// any code from here onward is guaranteed that arr contains all zeros!
Ignoring differences in scheduling, this task could equivalently be accomplished using a parallel for loop as follows:
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P) for
for (int i = 0; i < 128; ++i) {
arr[i] = 0;
}
// any code from here onward is guaranteed that arr contains all zeros!
Essentially, #pragma omp parallel enables you to describe regions of code that can execute in parallel - this can be much more flexible than a parallel for loop. In contrast, #pragma omp parallel for should generally be used to parallelize loops with independent iterations.
I can further elaborate on the differences in performance, if you would like.

Reordered output despite critical section

I'm trying to adapt this pascal triangle program to a parallel program using OpenMp. I used the for directive to parallelize the printPas function for loop, and put the conditional statements inside of the critical section so only one thread can print at a time, but it seems like I'm still getting a data race because my output is really inconsistent.
#include <stdio.h>
#ifndef N
#define N 2
#endif
unsigned int t1[2*N+1], t2[2*N+1];
unsigned int *e=t1, *r=t2;
int l = 0;
//the problem is here in this function
void printPas() {
#pragma omp parallel for private(l)
for (l=0; l<2*N+1; l++) {
#pragma omp critical
if (e[l]==0)
printf(" ");
else
printf("%6u", e[l]);
}
printf("\n");
}
void update() {
r[0] = e[1];
#pragma omp parallel for
for (int u=1; u<2*N; u++)
r[u] = e[u-1]+e[u+1];
r[2*N] = e[2*N-1];
unsigned int *tmp = e; e=r; r=tmp;
}
int main() {
e[N] = 1;
for (int i=0; i<N; i++) {
printPas();
update();
}
printPas();
}
Your critical section is causing the prints to run sequentially. Therefore, the code takes longer using 'critical' than it would if you didn't attempt to parallelise it.
Using different threads to print, you have no idea which one will access the critical section first. Therefore, the for-loop will not execute in the order that you would hope.
I suggest either removing the parallel directive ("#pragma omp parallel for private(l)"), or removing the 'critical' and accepting that the prints will come out in a different order every time.

Execute for loop iterations in openmp in order with dynamic schedule

I'd like to run a for loop in openmp with dynamic schedule.
#pragma omp for schedule(dynamic,chunk) private(i) nowait
for(i=0;i<n;i++){
//loop code here
}
and I'd like to have each thread executing ordered chunks such that
e.g. thread 1 -> iterations 0 to k
thread2 -> iterations k+1->k+chunk
etc..
Static schedule partly does what I want but I'd like to dynamically load balance the iterations.
Neither ordered clause, if I understood correctly what it does.
My question is how to make sure that the chunks assigned are ordered chunks?
I am using openmp 3.1 with gcc
You can implement this yourself without resorting to omp for, which is considered a convenience function by expert OpenMP programmers.
The following roughly illustrates what you might do. Please check the arithmetic carefully.
#pragma omp parallel
{
int me = omp_get_thread_num();
int nt = omp_get_num_threads();
int chunk = /* divide n by nt appropriately */
int start = me * chunk;
int end = (me+1) * chunk;
if (end > n) end = n;
for (int i = start; i < end; i++) {
/* do work */
}
} /* end parallel */
This does not do any dynamic load-balancing. You can do that yourself by assigning loop iterations unevenly to threads if you know the cost function a priori. You might read up on the inspector-executor model (e.g. 1).

Counting does not work properly in OpenMP

I have the function
void collatz(int startNumber, int endNumber, int* iter, int nThreads)
{
int i, n, counter;
int isodd; /* 1 if n is odd, 0 if even */
#pragma omp parallel for
for (i = startNumber; i <= endNumber; i++)
{
counter = 0;
n = i;
omp_set_num_threads(nThreads);
while (n > 1)
{
isodd = n%2;
if (isodd)
n = 3*n+1;
else
n/=2;
counter++;
}
iter[i - startNumber] = counter;
}
}
It works as I wish when running serial (i.e. compiling without OpenMP or commenting out #pragma omp parallel for and omp_set_num_threads(nThreads);). However, the parallel version produces the wrong result and I think it is because the counter variable need to be set to zero at the beginning of each for loop and perhaps another thread can work with the non-zeroed counter value. But even if I use #pragma omp parallel for private(counter), the problem still occurs. What am I missing?
I compile the program as C89.
Inside your OpenMP parallel region, you are assigning values to the counter, n and isodd scalar variables. These cannot therefore be just shared as they are by default. You need to pay extra attention to them.
A quick analysis shows that as their values is only meaningful inside the parallel region and only for the current thread, so it becomes clear that they need to be declared private.
Adding a private( counter, n, isodd ) clause to your #pragma omp parallel directive should fix the issue.

Resources