Here is a simple recursive program that splits into two for every recursive call. As expected, the result is 2 + 4 + 8 calls to rec, but the number of threads is always the same: two, and the ids bounce back and forth between 0 and one. I expected each recursive call to retain the id, and that 8 threads would be made in the end. What exactly is going on? Is there any issue with the code?
#include <stdio.h>
#include <omp.h>
void rec(int n) {
if (n == 0)
return;
#pragma omp parallel num_threads(2)
{
printf("Currently at %d -- total %d\n", omp_get_thread_num(), omp_get_num_threads());
rec(n - 1);
}
}
int main() {
omp_set_nested(1);
rec(3);
}
Your code is working as expected by OpenMP standard. In OpenMP documentation you can find the following about omp_get_num_threads:
Summary: The omp_get_num_threads routine returns the number of threads
in the current team.
Binding: The binding region for an omp_get_num_threads region is the
innermost enclosing parallel region.
Effect: The omp_get_num_threads routine returns the number of threads
in the team that is executing the parallel region to which the routine
region binds. If called from the sequential part of a program, this
routine returns 1.
omp_get_thread_num has the same binding region:
The binding region for an omp_get_thread_num region is the innermost
enclosing parallel region.
It means that omp_get_num_threads and omp_get_thread_num bind to the innermost parallel region only, so it does not matter how many nested parallel regions are used. Each of your parallel regions is defined by #pragma omp parallel num_threads(2), therefore the return value of omp_get_num_threads is 2 (as long as you have enough threads available) and the return value of omp_get_thread_num is either 0 or 1.
Related
In the OpenMP Specification, the following restriction is posed for a barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
Just for completeness, the definition of a region by OpenMP specification is as follows (cf. p.5, lines 9 ff.):
region
All code encountered during a specific instance of
the execution of a given construct, structured block sequence or
OpenMP library routine. A region includes any code in called routines
as well as any implementation code. [...]
I came up with a very simple example and I am asking myself whether it is at all valid, because the barriers are placed inside if-conditions (and not every barrier is "seen" by each thread). Nevertheless, the number of barriers is identical for each thread and experiments with two compilers show that the code works as expected.
#include <stdio.h>
#include <unistd.h>
#include <stdarg.h>
#include <sys/time.h>
#include "omp.h"
double zerotime;
double gettime(void) {
struct timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec * 1e-6;
}
void print(const char *format, ...) {
va_list args;
va_start (args, format);
#pragma omp critical
{
fprintf(stdout, "Time = %1.1lfs ", gettime() - zerotime);
vfprintf (stdout, format, args);
}
va_end (args);
}
void barrier_test_1(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
#pragma omp barrier
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
#pragma omp barrier
}
}
}
int main() {
zerotime = gettime();
#pragma omp parallel
{
barrier_test_1();
}
return 0;
}
For four threads I get the following output:
Time = 0.0s Path B: Thread 1 waiting
Time = 0.0s Path B: Thread 3 waiting
Time = 0.0s Path A: Thread 0 waiting
Time = 0.0s Path A: Thread 2 waiting
Time = 1.0s Path B: Thread 1 waiting
Time = 1.0s Path B: Thread 3 waiting
Time = 1.0s Path A: Thread 2 waiting
Time = 1.0s Path A: Thread 0 waiting
Time = 2.0s Path B: Thread 1 waiting
Time = 2.0s Path B: Thread 3 waiting
Time = 2.0s Path A: Thread 0 waiting
Time = 2.0s Path A: Thread 2 waiting
...
which shows that all the threads nicely wait for the slow Path B operation and pair up even though they are not placed in the same branch.
However, I am still confused from the specification, whether my code is at all valid.
Contrast this e.g. with CUDA where the following statement is given regarding the related __syncthreads() routine:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block,
otherwise the code execution is likely to hang or produce unintended
side effects.
Thus, in CUDA, such code as written above in terms of __syncthreads() would be invalid, because the condition omp_get_thread_num() % 2 == 0 evaluates differently depending on the thread.
Follow-up Question:
While I am quite ok with the conclusion that the code above is not following the specification, a slight modification of the code could be as follows, where barrier_test_1() is replaced by barrier_test_2():
void call_barrier(void) {
#pragma omp barrier
}
void barrier_test_2(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
call_barrier();
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
call_barrier();
}
}
}
We recognize, that we have only a single barrier placed inside the code and this one is visited by all threads in the team. While the above code would be still invalid in the CUDA case, I am still unsure about OpenMP. I think it boils down to the question what actually constitutes the barrier region, is it just the line in the code or is it all code which has been traversed between subsequent barriers? This is also the reason, why I looked up the definition of a region in the specification. More precisely, as far as I can see there is no code encountered during a specific instance of the execution of <the barrier construct>, which is due to the statement about stand-alone directives in the spec (p.45, lines 3+5)
Stand-alone directives are executable directives that have no
associated user code.
and
Stand-alone directives do not have any associated executable user
code.
and since (p.258 line 9)
The barrier construct is a stand-alone directive.
Maybe the following part of the spec is also of interest (p.259, lines 32-33):
The sequence of worksharing regions and barrier regions encountered
must be the same for every thread in a team.
Preliminary Conclusion:
We can wrap a barrier into a single function as above and replace all barriers by a call to the wrapper function which causes:
All threads either continue executing user code or wait at the barrier
If we call the wrapper only by a subset of threads, this will cause a deadlock but will not lead to undefined behavior
Between calls to the wrapper, the number of met barriers is identical among the threads
Essentially this means, we can safely synchronize and cut through different execution paths by the use of such wrapper
Am I correct?
In the OpenMP Specification, the following restriction is posed for a
barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
That description is a bit problematic because barrier is a stand-alone directive. That means it has no associated code other than the directive itself, and therefore there is no such thing as a "barrier region".
Nevertheless, I think the intent is clear, both from the wording itself and from the conventional behavior of barrier implementations: absent any cancellation, if any thread in a team executing the innermost parallel region containing a given barrier construct reaches that barrier, then all threads in the team must reach that same barrier construct. Different barrier constructs represent different barriers, each requiring all threads to arrive before any proceed past.
However, I am still confused from the specification, whether my code is at all valid.
I see that the behavior of your test code suggests that the two barriers are being treated as a single one. This is irrelevant to interpreting the specification, however, because your code indeed does not satisfy the requirement you asked about. The spec does not require the program to fail in any particular way in this case, but it certainly does not require the behavior you observe, either. You might well find that the program behaves differently with a different version of the compiler or a different OpenMP implementation. The compiler is entitled to assume that your OpenMP code conforms to the OpenMP spec.
Of course, in the case of your particular example, the solution is to replace the two barrier constructs in the different conditional branches with a single one immediately following the else block.
My code:
#include <cstdio>
#include "omp.h"
int main() {
omp_set_num_threads(4);
#pragma omp parallel
{
#pragma omp parallel for
for (int i = 0; i < 6; i++)
{
printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
}
}
return 0;
}
The output that I am getting:
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
Adding "parallel" is the cause of the problem, but I don't know how to explain it.
My Question is: Why is there only the main thread and runs the for loop four times?
By default, nested parallelism is disabled. Nonetheless, you can explicitly enable nested parallelism, by either:
omp_set_nested(1);
or by setting the OMP_NESTED environment variable to true.
also from the OpenMP standard we know that:
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region.
The thread that encountered the parallel construct becomes
the master thread of the new team, with a thread number of zero for
the duration of the new parallel region. All threads in the new team,
including the master thread, execute the region. Once the team is
created, the number of threads in the team remains constant for the
duration of that parallel region.
From source you can read the following.
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
This explains the reason why when you add the second parallel region there is only one thread per team executing the enclosing code (i.e., the for loop). In other words, from the first parallel region, 4 threads are created, each of those threads when encountering the second parallel region will create a new team and become the master of that team (i.e., will have the ID=0 within the newly created team). However, because you did not explicitly enable the nested parallelism, each of those teams is only composed of a single thread. Hence, 4 teams with a thread each will execute the for loop. Consequently, you will have the following statement:
printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
being printed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads across the 4 teams). The image below provides a visualization of that flow:
If you add a printf statement between the first and the second parallel region, as follows:
int main() {
omp_set_num_threads(4);
#pragma omp parallel
{
printf("Before nested parallel region: I am Thread{%d}\n", omp_get_thread_num());
#pragma omp parallel for // Adding "parallel" is the cause of the problem, but I don't know how to explain it.
for (int i = 0; i < 6; i++)
{
printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
}
}
return 0;
}
You would get something similar to the following output (bear in mind that the order in which the first 4 lines are outputted is nondeterministic).
Before nested parallel region: I am Thread{1}
Before nested parallel region: I am Thread{0}
Before nested parallel region: I am Thread{2}
Before nested parallel region: I am Thread{3}
i = 0, I am Thread 0
i = 0, I am Thread 0
i = 0, I am Thread 0
(...)
i = 5, I am Thread 0
Meaning that within the first parallel region (but still outside of the second parallel region) there is a single team of 4 threads -- with IDs varying from 0 to 3 -- executing in parallel. Hence, each of those threads will execute the printf statement:
printf("I am Thread outside the nested region {%d}\n", omp_get_thread_num());
and display a different value for the omp_get_thread_num() method call.
As previously mentioned, the nested parallelism is disabled. Thus, when each of those threads encounters the second parallel region, each will create a new team and becomes the master (i.e., will have the ID=0 within the newly created team). -- and the only member -- of that team. Hence, why the statement
printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
inside the loop, outputs always (..) I am Thread 0, since the method omp_get_thread_num() in this context will return always 0. However, even though the method omp_get_thread_num() is returning 0, it does not imply that the code is being executed sequently (by the thread with ID=0), but rather that each master of each of the 4 teams is returning their ID=0.
If you enabled the nested parallelism you will have a flow like shown in the image below:
The execution of threads 1 to 3 was omitted for simplicity sake, nonetheless it would have been the same as thread 0.
So, from the first parallel region, a team with 4 threads is created. After encountering the next parallel region each thread from the previous team, will create a new team of 4 threads each, so at the moment we have a total of 16 threads across 4 teams. Finally, each team will execute the entire for loop. However, because you have a #pragma omp parallel for constructor, the iterations of the for loop will be divided among the threads within each team.
Bear in mind that in the image above, I am assuming a certain static loop distribution of iterations among loops, I am not implying that the loop iterations will always be divided like this across all the implementations of the OpenMP standard.
I am trying to learn OpenMP, and have stumbled upon the fact that threads do not retain their own data when executing tasks, but they rather have a copy of the data of the thread which has generated the task. Let me demonstrate it with an example:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
}
}
}
return 0;
}
An example output of this code is as follows:
Thread ID of the #single: 1
thread_id, ID of the executing thread: 1, 2
thread_id, ID of the executing thread: 1, 0
thread_id, ID of the executing thread: 1, 3
thread_id, ID of the executing thread: 1, 1
...
It is evident that the thread_id within the task refers to a copy that is assigned to the thread_id of the thread that has created the task (i.e. the one running the single portion of the code).
What if I wanted to refer the executing thread's own private variables then? Are they unrecoverably shadowed? Is there a clause to make this code output number, same number instead at the end of each line?
I am trying to learn OpenMP, and have stumbled upon the fact that
threads do not retain their own data when executing tasks, but they
rather have a copy of the data of the thread which has generated the
task.
"[T]hreads do not retain their own data" is an odd way to describe it. Attributing data ownership to threads themselves instead of to the tasks they are performing is perhaps the key conceptual problem here. It is absolutely natural and to be expected that a thread performing a given task operates with and on the data environment of that task.
But if you're not accustomed to explicit tasks, then it is understandable that you've gotten away so far without appreciating the distinction here. The (many) constructs that give rise to implicit tasks are generally structured in ways that are not amenable to detecting the difference.
So with your example, yes,
the thread_id within the task refers to a copy that
is assigned to the thread_id of the thread that has created the task
(i.e. the one running the single portion of the code).
Although it may not be immediately obvious, that follows from the OMP specification:
When a thread encounters a task construct, an explicit task is
generated from the code for the associated structured-block. The data
environment of the task is created according to the data-sharing
attribute clauses on the task construct, per-data environment ICVs,
and any defaults that apply.
(OMP 5.0 Specification, section 2.10.1; emphasis added)
The only way that can be satisfied is if the task closes over any shared data from the context of its declaration, which is indeed what you observe. Moreover, this is typically what one wants -- the data on which a task is to operate should be established at the point of and by the context of its declaration, else how would one direct what a task is to do?
What if I wanted to refer the executing thread's own private variables
then?
Threads do not have variables, at least not in the terminology of OMP. Those belong to the "data environment" of whatever tasks they are executing at any given time.
Are they unrecoverably shadowed?
When a thread is executing a given task, it accesses the data environment of that task. That environment may include variables that are shared with other tasks, but only in that sense can it access the variables of another task. "Unrecoverably shadowed" is not the wording I would use to describe the situation, but it gets the idea across.
Is there a clause to make this
code output number, same number instead at the end of each line?
There are ways to restructure the code to achieve that, but none of them are as simple as just adding a clause to the omp task directive. In fact, I don't think any of them involve explicit tasks at all. The most natural way to get that would be with a parallel loop:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 10; i++) {
int thread_id = omp_get_thread_num();
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
return 0;
}
Of course, that also simplifies it to the point where it seems trivial, but perhaps that helps drive home the point. A large part of the purpose of declaring an explicit task is that that task may be executed by a different thread than the one that created it, which is exactly what you need to avoid to achieve the behavior you are asking for.
The problem is, that here you create four parallel threads:
#pragma omp parallel num_threads(4)
and here, you restrict the further execution to one single thread
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
From now on, only the context of this single thread is used, hence the same instance of the variable thread_id is used. Here
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
you indeed distribute the loop iteration on four threads, but based on the state of the single task (together with the corresponding instance of thread_id to which you restricted execution above. So a first measure is to end the single section directly after the printf (before the loop iterations start):
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
}
// Now outside the "single"
for (int i = 0; i < 10; i++) {
...
Now, for each iteration in the for loop, a task is created immediately. And this is performed for each of the four threads. So, you now have 40 tasks pending with
10 x thread_id == 0
10 x thread_id == 1
10 x thread_id == 2
10 x thread_id == 3
These tasks are now distributed amongst the threads arbitrarily. This is where the association between thread_id and the omp thread number gets lost. There is not much you can do about it, except for removing the
#pragma omp task
which leads to a similar result (with corresponding omp thread id and thread_id numbers), but works a bit different internally (the dissociation of the tasks and the omp threads does not take place).
Trying to get a simple OpenMP loop going, but I keep getting weird outputs. It doesn't list from 1 to 1000 straight, but goes from 501 to 750, then 1 to 1000. I'm guessing there's a threading issue? I'm compiling and running on VS2013.
#include <stdio.h>
#include <math.h>
int main(void)
{
int counter = 0;
double root = 0;
// OPEN MP SECTION
printf("Open MP section: \n\n");
getchar(); //Pause
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
{
root = sqrt(counter);
printf("The root of %d is %.2f\n", counter, root);
}
return(0);
}
The whole point of OpenMP is to run things in parallel, distributing work to different execution engines.
Hence, it's likely that the individual iterations of your loop are done out of order because that is the very nature of multi-threading.
While it may make sense for the calculations to be done in parallel (and hence possibly out of order), that's not really what you want for the printing of results.
One way to ensure the results are printed in the correct order is to defer the printing until after the parallel execution is complete. In other words, parallelise the calculation but serialise the output.
That of course means being able to store the information in, for example, an array, while the parallel operations are running.
In other words, something like:
// Make array instead of single value.
double root[1001];
// Parallelise just the calculation bit.
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
root[counter] = sqrt(counter);
// Leave the output as a serial operation,
// once all parallel operations are done.
for (counter = 0; counter <= 1000; counter++)
printf("The root of %d is %.2f\n", counter, root[counter]);
Store the results in an array and get the printf out of the loop. It has to serialize to the display.
Your code will not be run sequentially.
OpenMP Parallel Pragma:
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
'This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program."'
http://bisqwit.iki.fi/story/howto/openmp/#ParallelPragma
I am trying to create an OpenMP program that will sequentially iterate through a loop. I realize threads are not intended for sequential programs -- I'm trying to either get a little speedup compared to a single thread, or at least keep the execution time similar to a single-threaded program.
Inside my #pragma omp parallel section, each thread computes its own section of a large array and gets the sum of that portion. These all may run in parallel. Then I want the threads to run in order, and each sum is added to the TotalSum IN ORDER. So thread 1 has to wait for thread 0 to complete, and so on. I have this part inside a #pragma omp critical section. Everything runs fine, except that only thread 0 is completing and then the program exits. How can I ensure that the other threads will keep polling? I've tried sleep() and while loops, but it continues to exit after thread 0 completes.
I am not using #pragma omp parallel for because I need to keep track of the specific ranges of the master array that each thread accesses. Here is a shortened version of the code section in question:
//DONE and MasterArray are global arrays. DONE keeps track of all the threads that have completed
int Function()
{
#pragma omp parallel
{
int ID = omp_get_thread_num
variables: start,end,i,j,temp(array) (all are initialized here)
j = 0;
for (i = start; i < end; i++)
{
if(i != start)
temp[j] = MasterArray[i];
else
temp[j] = temp[j-1] + MasterArray[i];
j++;
}
#pragma omp critical
{
while(DONE[ID] == 0 && ERROR == 0) {
int size = sizeof(temp) / sizeof(temp[0]);
if (ID == 0) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
else if (DONE[ID-1] == 1) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
}
}
}
if (ERROR == 1)
return(-1);
else
return(0);
}
this function is called from main after initializing the number of threads. It seems to me that the parallel portion completes, then we check for an error. If an error is found, the loop terminates. I realize something is wrong here, but I can't figure out what it is, and now I'm just going in circles. Any help would be great. Again, my problem is that the function exits after only thread 0 executes, but no error has been flagged. I have it running in pthreads too, but that has been simpler to execute.
Thanks!
Your attempt of ordering threads with #pragma omp critical is totally incorrect. There can be just one thread in a critical section at any time, and the order in which the threads arrive to the critical section is not determined. So in your code it can happen that e.g. the thread #2 enters the critical section first and never leaves it, waiting for thread #1 to complete, while the thread #1 and the rest are waiting at #pragma omp critical. And even if some threads, e.g. thread #0, are lucky to complete the critical section in right order, they will wait on an implicit barrier at the end of the parallel region. In other words, the deadlock is almost guaranteed in this code.
I suggest you do something much simpler and natural to order your threads, namely an ordered section. It should look like this:
#pragma omp parallel
{
int ID = omp_get_thread_num();
// Computations done by each thread
#pragma omp for ordered schedule(static,1)
for( int t=0; t<omp_get_num_threads(); ++t )
{
assert( t==ID );
#pragma omp ordered
{
// Do the stuff you want to be in order
}
}
}
So you create a parallel loop with the number of iterations equal to the number of threads in the region. The schedule(static,1) clause makes it explicit that the iterations are assigned one per thread in the order of thread IDs; and the ordered clause allows to use ordered sections inside the loop. Now in the body of the loop you put an ordered section (the block following #pragma omp ordered), and it will be executed in the order of iterations, which is also the order of thread IDs (as ensured by the assertion).
For more information, you may look at this question: How does the omp ordered clause work?