OpenMP - Overhead when Spawning and Terminating Threads in for-loop

OpenMP - Overhead when Spawning and Terminating Threads in for-loop - c

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.

The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

Related

How to run a static parallel for loop without the main thread

I want to execute a funtion with multithreads, without using main thread. So this is what I want:
# pragma omp parallel num_threads(9)
{
// do something
# pragma omp for schedule(static,1)
for(int i = 0; i < 10; i++)
func(i); // random stuff
}
So I want func() to be executed just by 8 threads, without main thread. Is that possible somehow?

So I want func() to be executed just by 8 threads, without main
thread. Is that possible somehow?
Yes, you can do it. However, you will have to implement the functionality of
#pragma omp for schedule(static,1)
since, explicitly using the aforementioned clause will make the compiler automatically divide the iterations of the loop among the threads in the team, including the master thread of that team, which in your code example will be also the main thread. The code could look like the following:
# pragma omp parallel num_threads(9)
{
// do something
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
if(thread_id != 0) // all threads but the master thread
{
thread_id--; // shift all the ids
total_threads = total_threads - 1;
for(int i = thread_id ; i < 10; i += total_threads)
func(i); // random stuff
}
#pragma omp barrier
}
First, we ensure that all threads except the master executed the loop to be parallelized (i.e., if(thread_id != 0)), then we divided the iterations of the loop among the remaining threads (i.e., for(int i = thread_id ; i < 10; i += total_threads)), and finally we ensure that all threads wait for each other at the end of the parallel region (i.e., #pragma omp barrier).

If it isn't important which thread doesn't do the loop, another option would be to combine sections with the loop. This means nesting parallelism, which one should be very careful with, but it should work:
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{ /* work for one thread */ }
#pragma omp section
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
The main problem here is, that most likely one of those sections will be taking much longer than the other one, meaning that in the worst case (loop faster than first section) all but one thread are doing nothing most of the time.
If you really need the master thread to be outside the parallel region this might work (not tested):
#pragma omp parallel num_threads(2)
{
#pragma omp master
{ /* work for master thread, other thread is NOT waiting */ }
#pragma omp single
{
#pragma omp parallel for num_threads(8) schedule(static, 1)
for (int i = 0; i < N; ++i) { /* ... */ }
}
}
There is no guarantee that the master thread wont be computing the single region as well, but if your cores aren't over-occupied it should at least be unlikely. One could even argue that if the second thread from the outer parallel region doesn't reach the single region in time, it is better that the master thread also has a chance of going in there, even if that means, that the second thread doesn't get anything to do.
As the single region should only have an implicit barrier at it's end, while the master region doesn't contain any implicit barriers, they should potentially be executed in parallel as longs as the master region is in front of the single region. This assumes that the single region is well-implemented, such that every thread has a chance of computing it. This isn't guaranteed by the standard, I think.
EDIT:
These solutions require nested parallelism to work, which is disabled by default in most implementations. It can be activated via the environment variable OMP_NESTED or by calling omp_set_nested().

Execute for loop iterations in openmp in order with dynamic schedule

I'd like to run a for loop in openmp with dynamic schedule.
#pragma omp for schedule(dynamic,chunk) private(i) nowait
for(i=0;i<n;i++){
//loop code here
}
and I'd like to have each thread executing ordered chunks such that
e.g. thread 1 -> iterations 0 to k
thread2 -> iterations k+1->k+chunk
etc..
Static schedule partly does what I want but I'd like to dynamically load balance the iterations.
Neither ordered clause, if I understood correctly what it does.
My question is how to make sure that the chunks assigned are ordered chunks?
I am using openmp 3.1 with gcc

You can implement this yourself without resorting to omp for, which is considered a convenience function by expert OpenMP programmers.
The following roughly illustrates what you might do. Please check the arithmetic carefully.
#pragma omp parallel
{
int me = omp_get_thread_num();
int nt = omp_get_num_threads();
int chunk = /* divide n by nt appropriately */
int start = me * chunk;
int end = (me+1) * chunk;
if (end > n) end = n;
for (int i = start; i < end; i++) {
/* do work */
}
} /* end parallel */
This does not do any dynamic load-balancing. You can do that yourself by assigning loop iterations unevenly to threads if you know the cost function a priori. You might read up on the inspector-executor model (e.g. 1).

OpenMP average of an array

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
}
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.

First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
{
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
}
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
}
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

Specify which positions in an array a thread access

I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.

The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...

You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.

#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.

Openmp: increase for loop iteration number

I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?

A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OpenMP - Overhead when Spawning and Terminating Threads in for-loop - c

Related

How to run a static parallel for loop without the main thread

Execute for loop iterations in openmp in order with dynamic schedule

OpenMP average of an array

Specify which positions in an array a thread access

Openmp: increase for loop iteration number

Categories

Resources