How to fork a large number of threads in OpenMP? - c

for some reason I need to stress my processor and I want to fork a lot of threads in OpenMP. In pthreads you can easily do it using a for loop since it is forking a thread is just a function call. But in OpenMP you have to have something like this:
#pragma omp parallel sections
{
#pragma omp section
{
//section 0
}
#pragma omp section
{
//section 1
}
.... // repeat omp section for n times
}
I am just wondering if there is any easier way to fork a large number of threads in OpenMP?

You don't need to do anything special, almost. Just write code for a compute-intensive task and put it inside a parallel region. Then indicate what number of threads you want. In order to do that, you use omp_set_dynamic(0) to disable dynamic threads (this helps to achieve the number of threads you want, but it still won't be guaranteed), then omp_set_num_threads(NUM_THREADS) to indicate what number of threads you want.
Then each thread will clone the task you indicate in the code. Simple as that.
const int NUM_THREADS = 100;
omp_set_dynamic(0);
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
// How many threads did we really get? Let's write it once only.
#pragma omp single
{
cout << "using " << omp_get_num_threads() << " threads." << std::endl;
}
// write some compute-intensive code here
// (be sure to print the result at the end, so that
// the compiler doesn't throw away useless instructions)
}

To do what you want, you get the thread number and then do different things based on which thread you are.
// it's not guaranteed you will actually get this many threads
omp_set_num_threads( NUM_THREADS );
int actual_num_threads;
#pragma omp parallel
{
#pragma omp single
{
actual_num_threads = omp_get_num_threads();
}
int me = omp_get_thread_num();
if ( me < actual_num_threads / 2 ) {
section1();
}
else {
section2();
}
}

Related

Using omp_get_num_threads() inside the parallel section

I would like to set the number of threads in OpenMP. When I use
omp_set_num_threads(2);
printf("nthread = %d\n", omp_get_num_threads());
#pragma omp parallel for
...
I see nthreads=1. Explained here, the number of reported threads belongs to serial section which is always 1. However, when I move it to the line after #pragma, I get compilation error that after #pragma, for is expected. So, how can I fix that?
Well, yeah, omp parallel for expects a loop in the next line. You can call omp_get_num_threads inside that loop. Outside a parallel section, you can call omp_get_max_threads for the maximum number of threads to spawn. This is what you are looking for.
int max_threads = omp_get_max_threads();
#pragma omp parallel for
for(...) {
int current_threads = omp_get_num_threads();
assert(current_threads == max_threads);
}
#pragma omp parallel
{
int current_threads = omp_get_num_threads();
# pragma omp for
for(...) {
...
}
}

How to run a static parallel for loop without the main thread

I want to execute a funtion with multithreads, without using main thread. So this is what I want:
# pragma omp parallel num_threads(9)
{
// do something
# pragma omp for schedule(static,1)
for(int i = 0; i < 10; i++)
func(i); // random stuff
}
So I want func() to be executed just by 8 threads, without main thread. Is that possible somehow?
So I want func() to be executed just by 8 threads, without main
thread. Is that possible somehow?
Yes, you can do it. However, you will have to implement the functionality of
#pragma omp for schedule(static,1)
since, explicitly using the aforementioned clause will make the compiler automatically divide the iterations of the loop among the threads in the team, including the master thread of that team, which in your code example will be also the main thread. The code could look like the following:
# pragma omp parallel num_threads(9)
{
// do something
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
if(thread_id != 0) // all threads but the master thread
{
thread_id--; // shift all the ids
total_threads = total_threads - 1;
for(int i = thread_id ; i < 10; i += total_threads)
func(i); // random stuff
}
#pragma omp barrier
}
First, we ensure that all threads except the master executed the loop to be parallelized (i.e., if(thread_id != 0)), then we divided the iterations of the loop among the remaining threads (i.e., for(int i = thread_id ; i < 10; i += total_threads)), and finally we ensure that all threads wait for each other at the end of the parallel region (i.e., #pragma omp barrier).
If it isn't important which thread doesn't do the loop, another option would be to combine sections with the loop. This means nesting parallelism, which one should be very careful with, but it should work:
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{ /* work for one thread */ }
#pragma omp section
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
The main problem here is, that most likely one of those sections will be taking much longer than the other one, meaning that in the worst case (loop faster than first section) all but one thread are doing nothing most of the time.
If you really need the master thread to be outside the parallel region this might work (not tested):
#pragma omp parallel num_threads(2)
{
#pragma omp master
{ /* work for master thread, other thread is NOT waiting */ }
#pragma omp single
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
There is no guarantee that the master thread wont be computing the single region as well, but if your cores aren't over-occupied it should at least be unlikely. One could even argue that if the second thread from the outer parallel region doesn't reach the single region in time, it is better that the master thread also has a chance of going in there, even if that means, that the second thread doesn't get anything to do.
As the single region should only have an implicit barrier at it's end, while the master region doesn't contain any implicit barriers, they should potentially be executed in parallel as longs as the master region is in front of the single region. This assumes that the single region is well-implemented, such that every thread has a chance of computing it. This isn't guaranteed by the standard, I think.
EDIT:
These solutions require nested parallelism to work, which is disabled by default in most implementations. It can be activated via the environment variable OMP_NESTED or by calling omp_set_nested().

OpenMP unequal load without for loop

I have an OpenMP code that looks like the following
while(counter < MAX) {
#pragma omp parallel reduction(+:counter)
{
// do monte carlo stuff
// if a certain condition is met, counter is incremented
}
}
Hence, the idea is that the parallel section gets executed by the available threads as long as the counter is below a certain value. Depending on the scenario (I am doing MC stuff here, so it is random), the computations might take long than others, so that there is an imbalance between the workers here which becomes apparent because of the implicit barrier at the end of the parallel section.
It seems like #pragma omp parallel for might have ways to circumvent this (i.e. nowait directive/dynamic scheduling), but I can't use this, as I don't know an upper iteration number for the for loop.
Any ideas/design patterns how to deal with such a situation?
Best regards!
Run everything in a single parallel section and access the counter atomically.
int counter = 0;
#pragma omp parallel
while(1) {
int local_counter;
#pragma omp atomic read
local_counter = counter;
if (local_counter >= MAX) {
break;
}
// do monte carlo stuff
// if a certain condition is met, counter is incremented
if (certain_condition) {
#pragma omp atomic update
counter++;
}
}
You can't check directly in the while condition, because of the atomic access.
Note that this code will overshoot, i.e. counter > MAX is possible after the parallel section. Keep in mind that counter is shared and read/updated by many threads.

Openmp: increase for loop iteration number

I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?
A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.

How do I ask OpenMP to create threads only once at each run of the program?

I am trying to parallelize a large program that is written by a third-party. I cannot disclose the code, but I will try and give the closest example of what I wish to do.
Based on the code below. As you can see, since the clause "parallel" is INSIDE the while loop, the creation/destruction of the threads are(is) done with each iteration, which is costly.
Given that I cannot move the Initializors...etc to be outside the "while" loop.
--Base code
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
}//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
while(1)
{
funcCore();
}
}
}
What I seek to do is to avoid the creation/destruction per-iteration, and make it once at the start/end of the program. I tried many variations to the displacement of the "parallel" clause. What I basically has the same essence is the below: (ONLY ONE thread creation/destruction per-program run)
--What I tried, but failed "illegal access" in the initializing functions.
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
//#pragma omp parallel
// {
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
// }//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
while(1)
{
funcCore();
}
}
--
Any help would be highly appreciated!
Thanks!
OpenMP only creates worker thread at start. parallel pragma does not spawn thread. How do you determine the thread are spawned?
This can be done! The key here is to move the loop inside one single parallel section and make sure that whatever is used to determine whether to repeat or not, all threads will make exactly the same decision. I've used shared variables and do a synchronization just before the loop condition is checked.
So this code:
initialize();
while (some_condition) {
#pragma omp parallel
{
some_parallel_work();
}
}
can be transformed into something like this:
#pragma omp parallel
{
#pragma omp single
{
initialize(); //if initialization cannot be parallelized
}
while (some_condition_using_shared_variable) {
some_parallel_work();
update_some_condition_using_shared_variable();
#pragma omp flush
}
}
The most important thing is to be sure that every thread makes the same decision at the same points in your code.
As a final thought, essentially what one is doing is trading the overhead for creating/destroying threads (every time a section of #pragma omp parallel begins/ends) into synchronization overhead for the decision making of the threads. I think synchronizing should be faster however there are some many parameters at play here that this may not always be.

Resources