Accessing the executing thread's private variables within a task in OpenMP - c

I am trying to learn OpenMP, and have stumbled upon the fact that threads do not retain their own data when executing tasks, but they rather have a copy of the data of the thread which has generated the task. Let me demonstrate it with an example:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
}
}
}
return 0;
}
An example output of this code is as follows:
Thread ID of the #single: 1
thread_id, ID of the executing thread: 1, 2
thread_id, ID of the executing thread: 1, 0
thread_id, ID of the executing thread: 1, 3
thread_id, ID of the executing thread: 1, 1
...
It is evident that the thread_id within the task refers to a copy that is assigned to the thread_id of the thread that has created the task (i.e. the one running the single portion of the code).
What if I wanted to refer the executing thread's own private variables then? Are they unrecoverably shadowed? Is there a clause to make this code output number, same number instead at the end of each line?

I am trying to learn OpenMP, and have stumbled upon the fact that
threads do not retain their own data when executing tasks, but they
rather have a copy of the data of the thread which has generated the
task.
"[T]hreads do not retain their own data" is an odd way to describe it. Attributing data ownership to threads themselves instead of to the tasks they are performing is perhaps the key conceptual problem here. It is absolutely natural and to be expected that a thread performing a given task operates with and on the data environment of that task.
But if you're not accustomed to explicit tasks, then it is understandable that you've gotten away so far without appreciating the distinction here. The (many) constructs that give rise to implicit tasks are generally structured in ways that are not amenable to detecting the difference.
So with your example, yes,
the thread_id within the task refers to a copy that
is assigned to the thread_id of the thread that has created the task
(i.e. the one running the single portion of the code).
Although it may not be immediately obvious, that follows from the OMP specification:
When a thread encounters a task construct, an explicit task is
generated from the code for the associated structured-block. The data
environment of the task is created according to the data-sharing
attribute clauses on the task construct, per-data environment ICVs,
and any defaults that apply.
(OMP 5.0 Specification, section 2.10.1; emphasis added)
The only way that can be satisfied is if the task closes over any shared data from the context of its declaration, which is indeed what you observe. Moreover, this is typically what one wants -- the data on which a task is to operate should be established at the point of and by the context of its declaration, else how would one direct what a task is to do?
What if I wanted to refer the executing thread's own private variables
then?
Threads do not have variables, at least not in the terminology of OMP. Those belong to the "data environment" of whatever tasks they are executing at any given time.
Are they unrecoverably shadowed?
When a thread is executing a given task, it accesses the data environment of that task. That environment may include variables that are shared with other tasks, but only in that sense can it access the variables of another task. "Unrecoverably shadowed" is not the wording I would use to describe the situation, but it gets the idea across.
Is there a clause to make this
code output number, same number instead at the end of each line?
There are ways to restructure the code to achieve that, but none of them are as simple as just adding a clause to the omp task directive. In fact, I don't think any of them involve explicit tasks at all. The most natural way to get that would be with a parallel loop:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 10; i++) {
int thread_id = omp_get_thread_num();
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
return 0;
}
Of course, that also simplifies it to the point where it seems trivial, but perhaps that helps drive home the point. A large part of the purpose of declaring an explicit task is that that task may be executed by a different thread than the one that created it, which is exactly what you need to avoid to achieve the behavior you are asking for.

The problem is, that here you create four parallel threads:
#pragma omp parallel num_threads(4)
and here, you restrict the further execution to one single thread
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
From now on, only the context of this single thread is used, hence the same instance of the variable thread_id is used. Here
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
you indeed distribute the loop iteration on four threads, but based on the state of the single task (together with the corresponding instance of thread_id to which you restricted execution above. So a first measure is to end the single section directly after the printf (before the loop iterations start):
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
}
// Now outside the "single"
for (int i = 0; i < 10; i++) {
...
Now, for each iteration in the for loop, a task is created immediately. And this is performed for each of the four threads. So, you now have 40 tasks pending with
10 x thread_id == 0
10 x thread_id == 1
10 x thread_id == 2
10 x thread_id == 3
These tasks are now distributed amongst the threads arbitrarily. This is where the association between thread_id and the omp thread number gets lost. There is not much you can do about it, except for removing the
#pragma omp task
which leads to a similar result (with corresponding omp thread id and thread_id numbers), but works a bit different internally (the dissociation of the tasks and the omp threads does not take place).

Related

How do OpenMP thread ids work with recursion?

Here is a simple recursive program that splits into two for every recursive call. As expected, the result is 2 + 4 + 8 calls to rec, but the number of threads is always the same: two, and the ids bounce back and forth between 0 and one. I expected each recursive call to retain the id, and that 8 threads would be made in the end. What exactly is going on? Is there any issue with the code?
#include <stdio.h>
#include <omp.h>
void rec(int n) {
if (n == 0)
return;
#pragma omp parallel num_threads(2)
{
printf("Currently at %d -- total %d\n", omp_get_thread_num(), omp_get_num_threads());
rec(n - 1);
}
}
int main() {
omp_set_nested(1);
rec(3);
}
Your code is working as expected by OpenMP standard. In OpenMP documentation you can find the following about omp_get_num_threads:
Summary: The omp_get_num_threads routine returns the number of threads
in the current team.
Binding: The binding region for an omp_get_num_threads region is the
innermost enclosing parallel region.
Effect: The omp_get_num_threads routine returns the number of threads
in the team that is executing the parallel region to which the routine
region binds. If called from the sequential part of a program, this
routine returns 1.
omp_get_thread_num has the same binding region:
The binding region for an omp_get_thread_num region is the innermost
enclosing parallel region.
It means that omp_get_num_threads and omp_get_thread_num bind to the innermost parallel region only, so it does not matter how many nested parallel regions are used. Each of your parallel regions is defined by #pragma omp parallel num_threads(2), therefore the return value of omp_get_num_threads is 2 (as long as you have enough threads available) and the return value of omp_get_thread_num is either 0 or 1.

OpenMP: Is a barrier inside conditional code valid?

In the OpenMP Specification, the following restriction is posed for a barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
Just for completeness, the definition of a region by OpenMP specification is as follows (cf. p.5, lines 9 ff.):
region
All code encountered during a specific instance of
the execution of a given construct, structured block sequence or
OpenMP library routine. A region includes any code in called routines
as well as any implementation code. [...]
I came up with a very simple example and I am asking myself whether it is at all valid, because the barriers are placed inside if-conditions (and not every barrier is "seen" by each thread). Nevertheless, the number of barriers is identical for each thread and experiments with two compilers show that the code works as expected.
#include <stdio.h>
#include <unistd.h>
#include <stdarg.h>
#include <sys/time.h>
#include "omp.h"
double zerotime;
double gettime(void) {
struct timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec * 1e-6;
}
void print(const char *format, ...) {
va_list args;
va_start (args, format);
#pragma omp critical
{
fprintf(stdout, "Time = %1.1lfs ", gettime() - zerotime);
vfprintf (stdout, format, args);
}
va_end (args);
}
void barrier_test_1(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
#pragma omp barrier
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
#pragma omp barrier
}
}
}
int main() {
zerotime = gettime();
#pragma omp parallel
{
barrier_test_1();
}
return 0;
}
For four threads I get the following output:
Time = 0.0s Path B: Thread 1 waiting
Time = 0.0s Path B: Thread 3 waiting
Time = 0.0s Path A: Thread 0 waiting
Time = 0.0s Path A: Thread 2 waiting
Time = 1.0s Path B: Thread 1 waiting
Time = 1.0s Path B: Thread 3 waiting
Time = 1.0s Path A: Thread 2 waiting
Time = 1.0s Path A: Thread 0 waiting
Time = 2.0s Path B: Thread 1 waiting
Time = 2.0s Path B: Thread 3 waiting
Time = 2.0s Path A: Thread 0 waiting
Time = 2.0s Path A: Thread 2 waiting
...
which shows that all the threads nicely wait for the slow Path B operation and pair up even though they are not placed in the same branch.
However, I am still confused from the specification, whether my code is at all valid.
Contrast this e.g. with CUDA where the following statement is given regarding the related __syncthreads() routine:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block,
otherwise the code execution is likely to hang or produce unintended
side effects.
Thus, in CUDA, such code as written above in terms of __syncthreads() would be invalid, because the condition omp_get_thread_num() % 2 == 0 evaluates differently depending on the thread.
Follow-up Question:
While I am quite ok with the conclusion that the code above is not following the specification, a slight modification of the code could be as follows, where barrier_test_1() is replaced by barrier_test_2():
void call_barrier(void) {
#pragma omp barrier
}
void barrier_test_2(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
call_barrier();
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
call_barrier();
}
}
}
We recognize, that we have only a single barrier placed inside the code and this one is visited by all threads in the team. While the above code would be still invalid in the CUDA case, I am still unsure about OpenMP. I think it boils down to the question what actually constitutes the barrier region, is it just the line in the code or is it all code which has been traversed between subsequent barriers? This is also the reason, why I looked up the definition of a region in the specification. More precisely, as far as I can see there is no code encountered during a specific instance of the execution of <the barrier construct>, which is due to the statement about stand-alone directives in the spec (p.45, lines 3+5)
Stand-alone directives are executable directives that have no
associated user code.
and
Stand-alone directives do not have any associated executable user
code.
and since (p.258 line 9)
The barrier construct is a stand-alone directive.
Maybe the following part of the spec is also of interest (p.259, lines 32-33):
The sequence of worksharing regions and barrier regions encountered
must be the same for every thread in a team.
Preliminary Conclusion:
We can wrap a barrier into a single function as above and replace all barriers by a call to the wrapper function which causes:
All threads either continue executing user code or wait at the barrier
If we call the wrapper only by a subset of threads, this will cause a deadlock but will not lead to undefined behavior
Between calls to the wrapper, the number of met barriers is identical among the threads
Essentially this means, we can safely synchronize and cut through different execution paths by the use of such wrapper
Am I correct?
In the OpenMP Specification, the following restriction is posed for a
barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
That description is a bit problematic because barrier is a stand-alone directive. That means it has no associated code other than the directive itself, and therefore there is no such thing as a "barrier region".
Nevertheless, I think the intent is clear, both from the wording itself and from the conventional behavior of barrier implementations: absent any cancellation, if any thread in a team executing the innermost parallel region containing a given barrier construct reaches that barrier, then all threads in the team must reach that same barrier construct. Different barrier constructs represent different barriers, each requiring all threads to arrive before any proceed past.
However, I am still confused from the specification, whether my code is at all valid.
I see that the behavior of your test code suggests that the two barriers are being treated as a single one. This is irrelevant to interpreting the specification, however, because your code indeed does not satisfy the requirement you asked about. The spec does not require the program to fail in any particular way in this case, but it certainly does not require the behavior you observe, either. You might well find that the program behaves differently with a different version of the compiler or a different OpenMP implementation. The compiler is entitled to assume that your OpenMP code conforms to the OpenMP spec.
Of course, in the case of your particular example, the solution is to replace the two barrier constructs in the different conditional branches with a single one immediately following the else block.

How to continue work with master while threads execute for loop iterations?

I am asked to write an OpenMP program in C such that the main thread distributes the work to other threads, and while they are working on their tasks, the main should periodically check whether they are done, and if not, it should increment a shared variable.
This is the function for the threads' task:
void work_together(int *a, int n, int number, int thread_count) {
# pragma omp parallel for num_threads(thread_count) \
shared(a, n, number) private(i) schedule(static, n/thread_count)
for (long i=0; i<n; i++) {
// do a task, such as:
a[i] = a[i] * number;
}
}
And it gets called from main:
int main(int argc, char *argv[]) {
int n = atoi(argv[1]);
int arr[n];
initialize(arr, n);
// this will be the shared variable
int number = 2;
work_together(arr, n, number, thread_count);
//I want to write a function or an if to check whether threads are still working
/* if (threads_still_working()) {
number++;
sleep(100);
}
*/
printf("There are %d threads\n", omp_get_num_threads());
}
thread_count is initialized as 4, and I tried executing it for large n's (>10000), but the master thread will always wait for the other threads to finish executing the for loop, and will only continue the main when work_together() returns: the printf() will always print that there's only one thread running.
Now, what would be a way to check from the master thread whether the other threads are still running, and do some incrementing if they are?
From the OpenMP standard one can read:
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region. The thread that encountered
the parallel construct becomes the master thread of the new team, with
a thread number of zero for the duration of the new parallel region.
All threads in the new team, including the master thread, execute the
region. Once the team is created, the number of threads in the team
remains constant for the duration of that parallel region.
Consequently, with the clause #pragma omp parallel for num_threads all threads will be performing the parallel work (i.e., computing the iterations of the loop), which is something that you do not want. To get around this, you can implement part of the functionality of
`#pragma omp parallel for num_threads`
since, explicitly using the aforementioned clause will make the compiler automatically divide the iterations of the loop among the threads in the team, including the master thread of that team. The code would look the following:
# pragma omp parallel num_threads(thread_count) shared(a, n, number)
{
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
if(thread_id != 0) // all threads but the master thread
{
thread_id--; // shift all the ids
total_threads = total_threads - 1;
for(long i = thread_id ; i < n; i += total_threads) {
// do a task, such as:
a[i] = a[i] * number;
}
}
}
First, we ensure that all threads except the master (i.e., if(thread_id != 0)) execute the loop to be parallelized, then we divided the iterations of the loop among the remaining threads (i.e., for(int i = thread_id ; i < n; i += total_threads)). I have chosen a static distribution of chunk=1, you can choose a different one, but you will have to adapt the loop accordingly.
Now you just need to add the logic to:
Now, what would be a way to check from the master thread whether the
other threads are still running, and do some incrementing if they are?
So that I do not give away too much I will add the pseudocode that you will have to covert to real code to make it work:
// declare two shared variable
// 1) to count the number of threads that have finished working count_thread_finished
# pragma omp parallel num_threads(thread_count) shared(a, n, number)
{
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
if(thread_id != 0) // all threads but the master thread
{
thread_id--; // shift all the ids
total_threads = total_threads - 1;
for(long i = thread_id ; i < n; i += total_threads) {
// do a task, such as:
a[i] = a[i] * number;
}
// count_thread_finished++
}
else{ // the master thread
while(count_thread_finished != total_threads -1){
// wait for a while....
}
}
}
Bear in mind, however, that since the variable count_thread_finished is shared among threads, you will need to ensure mutual exclusion (e.g., using omp atomic) on its updates, otherwise you will have a race-condition. This should give you enough to keep going.
Btw: schedule(static, n/thread_count) is mostly not needed since by default most OpenMP implementations already divide the iterations of the loop (among the threads) as continuous chunks.

Multithreaded program with mutex on mutual resource [duplicate]

This question already has an answer here:
Pthread_create() incorrect start routine parameter passing
(1 answer)
Closed 3 years ago.
I tried to build a program which should create threads and assign a Print function to each one of them, while the main process should use printf function directly.
Firstly, I made it without any synchronization means and expected to get a randomized output.
Later I tried to add a mutex to the Print function which was assigned to the threads and expected to get a chronological output but it seems like the mutex had no effect about the output.
Should I use a mutex on the printf function in the main process as well?
Thanks in advance
My code:
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
pthread_t threadID[20];
pthread_mutex_t lock;
void* Print(void* _num);
int main(void)
{
int num = 20, indx = 0, k = 0;
if (pthread_mutex_init(&lock, NULL))
{
perror("err pthread_mutex_init\n");
return errno;
}
for (; indx < num; ++indx)
{
if (pthread_create(&threadID[indx], NULL, Print, &indx))
{
perror("err pthread_create\n");
return errno;
}
}
for (; k < num; ++k)
{
printf("%d from main\n", k);
}
indx = 0;
for (; indx < num; ++indx)
{
if (pthread_join(threadID[indx], NULL))
{
perror("err pthread_join\n");
return errno;
}
}
pthread_mutex_destroy(&lock);
return 0;
}
void* Print(void* _indx)
{
pthread_mutex_lock(&lock);
printf("%d from thread\n", *(int*)_indx);
pthread_mutex_unlock(&lock);
return NULL;
}
All questions of program bugs notwithstanding, pthreads mutexes provide only mutual exclusion, not any guarantee of scheduling order. This is typical of mutex implementations. Similarly, pthread_create() only creates and starts threads; it does not make any guarantee about scheduling order, such as would justify an assumption that the threads reach the pthread_mutex_lock() call in the same order that they were created.
Overall, if you want to order thread activities based on some characteristic of the threads, then you have to manage that yourself. You need to maintain a sense of which thread's turn it is, and provide a mechanism sufficient to make a thread notice when it's turn arrives. In some circumstances, with some care, you can do this by using semaphores instead of mutexes. The more general solution, however, is to use a condition variable together with your mutex, and some shared variable that serves as to indicate who's turn it currently is.
The code passes the address of the same local variable to all threads. Meanwhile, this variable gets updated by the main thread.
Instead pass it by value cast to void*.
Fix:
pthread_create(&threadID[indx], NULL, Print, (void*)indx)
// ...
printf("%d from thread\n", (int)_indx);
Now, since there is no data shared between the threads, you can remove that mutex.
All the threads created in the for loop have different value of indx. Because of the operating system scheduler, you can never be sure which thread will run. Therefore, the values printed are in random order depending on the randomness of the scheduler. The second for-loop running in the parent thread will run immediately after creating the child threads. Again, the scheduler decides the order of what thread should run next.
Every OS should have an interrupt (at least the major operating systems have). When running the for-loop in the parent thread, an interrupt might happen and leaves the scheduler to make a decision of which thread to run. Therefore, the numbers being printed in the parent for-loop are printed randomly, because all threads run "concurrently".
Joining a thread means waiting for a thread. If you want to make sure you print all numbers in the parent for loop in chronological order, without letting child thread interrupt it, then relocate the for-loop section to be after the thread joining.

openmp ordering critical sections

I am trying to create an OpenMP program that will sequentially iterate through a loop. I realize threads are not intended for sequential programs -- I'm trying to either get a little speedup compared to a single thread, or at least keep the execution time similar to a single-threaded program.
Inside my #pragma omp parallel section, each thread computes its own section of a large array and gets the sum of that portion. These all may run in parallel. Then I want the threads to run in order, and each sum is added to the TotalSum IN ORDER. So thread 1 has to wait for thread 0 to complete, and so on. I have this part inside a #pragma omp critical section. Everything runs fine, except that only thread 0 is completing and then the program exits. How can I ensure that the other threads will keep polling? I've tried sleep() and while loops, but it continues to exit after thread 0 completes.
I am not using #pragma omp parallel for because I need to keep track of the specific ranges of the master array that each thread accesses. Here is a shortened version of the code section in question:
//DONE and MasterArray are global arrays. DONE keeps track of all the threads that have completed
int Function()
{
#pragma omp parallel
{
int ID = omp_get_thread_num
variables: start,end,i,j,temp(array) (all are initialized here)
j = 0;
for (i = start; i < end; i++)
{
if(i != start)
temp[j] = MasterArray[i];
else
temp[j] = temp[j-1] + MasterArray[i];
j++;
}
#pragma omp critical
{
while(DONE[ID] == 0 && ERROR == 0) {
int size = sizeof(temp) / sizeof(temp[0]);
if (ID == 0) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
else if (DONE[ID-1] == 1) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
}
}
}
if (ERROR == 1)
return(-1);
else
return(0);
}
this function is called from main after initializing the number of threads. It seems to me that the parallel portion completes, then we check for an error. If an error is found, the loop terminates. I realize something is wrong here, but I can't figure out what it is, and now I'm just going in circles. Any help would be great. Again, my problem is that the function exits after only thread 0 executes, but no error has been flagged. I have it running in pthreads too, but that has been simpler to execute.
Thanks!
Your attempt of ordering threads with #pragma omp critical is totally incorrect. There can be just one thread in a critical section at any time, and the order in which the threads arrive to the critical section is not determined. So in your code it can happen that e.g. the thread #2 enters the critical section first and never leaves it, waiting for thread #1 to complete, while the thread #1 and the rest are waiting at #pragma omp critical. And even if some threads, e.g. thread #0, are lucky to complete the critical section in right order, they will wait on an implicit barrier at the end of the parallel region. In other words, the deadlock is almost guaranteed in this code.
I suggest you do something much simpler and natural to order your threads, namely an ordered section. It should look like this:
#pragma omp parallel
{
int ID = omp_get_thread_num();
// Computations done by each thread
#pragma omp for ordered schedule(static,1)
for( int t=0; t<omp_get_num_threads(); ++t )
{
assert( t==ID );
#pragma omp ordered
{
// Do the stuff you want to be in order
}
}
}
So you create a parallel loop with the number of iterations equal to the number of threads in the region. The schedule(static,1) clause makes it explicit that the iterations are assigned one per thread in the order of thread IDs; and the ordered clause allows to use ordered sections inside the loop. Now in the body of the loop you put an ordered section (the block following #pragma omp ordered), and it will be executed in the order of iterations, which is also the order of thread IDs (as ensured by the assertion).
For more information, you may look at this question: How does the omp ordered clause work?

Resources