Sync while-loops on differents tasks with OpenMP - c

I'm trying to use openMP on C, but I don't know how to sync two whiles on differents tasks.
Each step in the while-loops depends on all previous steps. It is an implementation for simulating hardware, if anyone was wondering.
I need that both loops start each iteration at the same time and the loop which ends his iteration first, wait the other loop. I don't know if OMP has something to do this or how I should do manually
#pragma omp parallel
{
#pragma omp master
{
#pragma omp task
while(...) // will be run some hundreds of millions of times
{
...
}
#pragma omp task
while(...) // will be run some hundreds of millions of times
{
...
}
}
}
Thank you

Related

Splitting OpenMP threads on unbalanced tree

I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?
Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}

OpenMP nested parallelism with sections

I have the following situation: I have a big outer for loop that essentially contains a function foo(). Within foo(), there are bar1() and bar2() that can be carried out concurrently, and bar3() that need to be performed after bar1() and bar2() are done. I have parallelized the big outer loop, and section bar1() and bar2(). I assume that each outer loop thread will generate their own section threads, is this correct?
If the assumption above is correct, how do I get bar3() to perform only after threads carrying out bar1() and bar2() finished? If I use critical, it will halt on all threads, including the outer for loop. If I use single, there's no guarantee that bar1() and bar2() will finish.
If the assumption above is not correct, how do I force the outer loop threads to resuse threads for bar1(), bar2() and not generate new threads every time?
Note that temp is a variable whose init and clear are expensive so I pull init and clear outside the for loop. It further complicates matter because both bar1() and bar2() needs some kind of temp variable. Optimally, temp should be init and cleared for each thread that is created, but I'm not sure how to force that for the threads generated for sections. (Without the sections pragma, it works fine in the parallel block).
main(){
#pragma omp parallel private(temp)
init(temp);
#pragma omp for schedule(static)
for (i=0;i<100000;i++) {
foo(temp);
}
clear(temp);
}
foo() {
init(x); init(y);
#pragma omp sections
{
{ bar1(x,temp); }
#pragma omp section
{ bar2(y,temp); }
}
bar3(x,y,temp);
}
I believe that simply parallelizing the for loop should give you enough parallelism to saturate the resources in CPU. But if you really want to run two functions in parallel, following code should work.
main(){
#pragma omp parallel private(temp)
{
init(temp);
#pragma omp for schedule(static)
for (i=0;i<100000;i++) {
foo(temp);
}
clear(temp);
}
}
foo() {
init(x); init(y);
#pragma omp task
bar1(x,temp);
bar2(y,temp);
#pragma omp taskwait
bar3(x,y,temp);
}

Task scheduling points of OpenMP tasks

I have the following code:
#pragma omp parallel
{
#pragma omp single
{
for(node* p = head; p; p = p->next)
{
preprocess(p);
#pragma omp task
process(p);
}
}
}
I would like to know when do the threads start computing the tasks. As soon as the task is created with #pragma omp task or only after all tasks are created?
Edit:
int* array = (int*)malloc...
#pragma omp parallel
{
#pragma omp single
{
while(...){
preprocess(array);
#pragma omp task firstprivate(array)
process(array);
}
}
}
In your example, the worker threads can start executing the created tasks as soon as they have been created. There's no need to wait for the completion of the creation of all tasks before the first task is executed.
So, basically, after the first task has been created by the producer, one worker will pick it up and start executing the task. However, be advised that the OpenMP runtime and compiler have certain freedom in this. They might defer execution a bit or even execute some of the tasks in place.
If you want to read up the details, you will need to dig through the OpenMP specification at www.openmp.org. It's a bit hard to read, but it is the definitive source of information.
Cheers,
-michael

How to nest parallel loops in a sequential loop with OpenMP

I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.
Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.
In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.

How do I ask OpenMP to create threads only once at each run of the program?

I am trying to parallelize a large program that is written by a third-party. I cannot disclose the code, but I will try and give the closest example of what I wish to do.
Based on the code below. As you can see, since the clause "parallel" is INSIDE the while loop, the creation/destruction of the threads are(is) done with each iteration, which is costly.
Given that I cannot move the Initializors...etc to be outside the "while" loop.
--Base code
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
}//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
while(1)
{
funcCore();
}
}
}
What I seek to do is to avoid the creation/destruction per-iteration, and make it once at the start/end of the program. I tried many variations to the displacement of the "parallel" clause. What I basically has the same essence is the below: (ONLY ONE thread creation/destruction per-program run)
--What I tried, but failed "illegal access" in the initializing functions.
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
//#pragma omp parallel
// {
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
// }//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
while(1)
{
funcCore();
}
}
--
Any help would be highly appreciated!
Thanks!
OpenMP only creates worker thread at start. parallel pragma does not spawn thread. How do you determine the thread are spawned?
This can be done! The key here is to move the loop inside one single parallel section and make sure that whatever is used to determine whether to repeat or not, all threads will make exactly the same decision. I've used shared variables and do a synchronization just before the loop condition is checked.
So this code:
initialize();
while (some_condition) {
#pragma omp parallel
{
some_parallel_work();
}
}
can be transformed into something like this:
#pragma omp parallel
{
#pragma omp single
{
initialize(); //if initialization cannot be parallelized
}
while (some_condition_using_shared_variable) {
some_parallel_work();
update_some_condition_using_shared_variable();
#pragma omp flush
}
}
The most important thing is to be sure that every thread makes the same decision at the same points in your code.
As a final thought, essentially what one is doing is trading the overhead for creating/destroying threads (every time a section of #pragma omp parallel begins/ends) into synchronization overhead for the decision making of the threads. I think synchronizing should be faster however there are some many parameters at play here that this may not always be.

Resources