OpenMP: Is a barrier inside conditional code valid?

OpenMP: Is a barrier inside conditional code valid? - c

In the OpenMP Specification, the following restriction is posed for a barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
Just for completeness, the definition of a region by OpenMP specification is as follows (cf. p.5, lines 9 ff.):
region
All code encountered during a specific instance of
the execution of a given construct, structured block sequence or
OpenMP library routine. A region includes any code in called routines
as well as any implementation code. [...]
I came up with a very simple example and I am asking myself whether it is at all valid, because the barriers are placed inside if-conditions (and not every barrier is "seen" by each thread). Nevertheless, the number of barriers is identical for each thread and experiments with two compilers show that the code works as expected.
#include <stdio.h>
#include <unistd.h>
#include <stdarg.h>
#include <sys/time.h>
#include "omp.h"
double zerotime;
double gettime(void) {
struct timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec * 1e-6;
}
void print(const char *format, ...) {
va_list args;
va_start (args, format);
#pragma omp critical
{
fprintf(stdout, "Time = %1.1lfs ", gettime() - zerotime);
vfprintf (stdout, format, args);
}
va_end (args);
}
void barrier_test_1(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
#pragma omp barrier
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
#pragma omp barrier
}
}
}
int main() {
zerotime = gettime();
#pragma omp parallel
{
barrier_test_1();
}
return 0;
}
For four threads I get the following output:
Time = 0.0s Path B: Thread 1 waiting
Time = 0.0s Path B: Thread 3 waiting
Time = 0.0s Path A: Thread 0 waiting
Time = 0.0s Path A: Thread 2 waiting
Time = 1.0s Path B: Thread 1 waiting
Time = 1.0s Path B: Thread 3 waiting
Time = 1.0s Path A: Thread 2 waiting
Time = 1.0s Path A: Thread 0 waiting
Time = 2.0s Path B: Thread 1 waiting
Time = 2.0s Path B: Thread 3 waiting
Time = 2.0s Path A: Thread 0 waiting
Time = 2.0s Path A: Thread 2 waiting
...
which shows that all the threads nicely wait for the slow Path B operation and pair up even though they are not placed in the same branch.
However, I am still confused from the specification, whether my code is at all valid.
Contrast this e.g. with CUDA where the following statement is given regarding the related __syncthreads() routine:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block,
otherwise the code execution is likely to hang or produce unintended
side effects.
Thus, in CUDA, such code as written above in terms of __syncthreads() would be invalid, because the condition omp_get_thread_num() % 2 == 0 evaluates differently depending on the thread.
Follow-up Question:
While I am quite ok with the conclusion that the code above is not following the specification, a slight modification of the code could be as follows, where barrier_test_1() is replaced by barrier_test_2():
void call_barrier(void) {
#pragma omp barrier
}
void barrier_test_2(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
call_barrier();
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
call_barrier();
}
}
}
We recognize, that we have only a single barrier placed inside the code and this one is visited by all threads in the team. While the above code would be still invalid in the CUDA case, I am still unsure about OpenMP. I think it boils down to the question what actually constitutes the barrier region, is it just the line in the code or is it all code which has been traversed between subsequent barriers? This is also the reason, why I looked up the definition of a region in the specification. More precisely, as far as I can see there is no code encountered during a specific instance of the execution of <the barrier construct>, which is due to the statement about stand-alone directives in the spec (p.45, lines 3+5)
Stand-alone directives are executable directives that have no
associated user code.
and
Stand-alone directives do not have any associated executable user
code.
and since (p.258 line 9)
The barrier construct is a stand-alone directive.
Maybe the following part of the spec is also of interest (p.259, lines 32-33):
The sequence of worksharing regions and barrier regions encountered
must be the same for every thread in a team.
Preliminary Conclusion:
We can wrap a barrier into a single function as above and replace all barriers by a call to the wrapper function which causes:
All threads either continue executing user code or wait at the barrier
If we call the wrapper only by a subset of threads, this will cause a deadlock but will not lead to undefined behavior
Between calls to the wrapper, the number of met barriers is identical among the threads
Essentially this means, we can safely synchronize and cut through different execution paths by the use of such wrapper
Am I correct?

In the OpenMP Specification, the following restriction is posed for a
barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
That description is a bit problematic because barrier is a stand-alone directive. That means it has no associated code other than the directive itself, and therefore there is no such thing as a "barrier region".
Nevertheless, I think the intent is clear, both from the wording itself and from the conventional behavior of barrier implementations: absent any cancellation, if any thread in a team executing the innermost parallel region containing a given barrier construct reaches that barrier, then all threads in the team must reach that same barrier construct. Different barrier constructs represent different barriers, each requiring all threads to arrive before any proceed past.
However, I am still confused from the specification, whether my code is at all valid.
I see that the behavior of your test code suggests that the two barriers are being treated as a single one. This is irrelevant to interpreting the specification, however, because your code indeed does not satisfy the requirement you asked about. The spec does not require the program to fail in any particular way in this case, but it certainly does not require the behavior you observe, either. You might well find that the program behaves differently with a different version of the compiler or a different OpenMP implementation. The compiler is entitled to assume that your OpenMP code conforms to the OpenMP spec.
Of course, in the case of your particular example, the solution is to replace the two barrier constructs in the different conditional branches with a single one immediately following the else block.

Related

Use while loop to make a thread wait till the lock variable is set to avoid race condition in C prgramming

#include <stdio.h>
#include <pthread.h>
long mails = 0;
int lock = 0;
void *routine()
{
printf("Thread Start\n");
for (long i = 0; i < 100000; i++)
{
while (lock)
{
}
lock = 1;
mails++;
lock = 0;
}
printf("Thread End\n");
}
int main(int argc, int *argv[])
{
pthread_t p1, p2;
if (pthread_create(&p1, NULL, &routine, NULL) != 0)
{
return 1;
}
if (pthread_create(&p2, NULL, &routine, NULL) != 0)
{
return 2;
}
if (pthread_join(p1, NULL) != 0)
{
return 3;
}
if (pthread_join(p2, NULL) != 0)
{
return 4;
}
printf("Number of mails: %ld \n", mails);
return 0;
}
In the above code each thread runs a for loop to increase the value
of mails by 100000.
To avoid race condition is used lock variable
along with while loop.
Using while loop in routine function does not
help to avoid race condition and give correct output for mails
variable.

In C, the compiler can safely assume a (global) variable is not modified by other threads unless in few cases (eg. volatile variable, atomic accesses). This means the compiler can assume lock is not modified and while (lock) {} can be replaced with an infinite loop. In fact, this kind of loop cause an undefined behaviour since it does not have any visible effect. This means the compiler can remove it (or generate a wrong code). The compiler can also remove the lock = 1 statement since it is followed by lock = 0. The resulting code is bogus. Note that even if the compiler would generate a correct code, some processor (eg. AFAIK ARM and PowerPC) can reorder instructions resulting in a bogus behaviour.
To make sure accesses between multiple threads are correct, you need at least atomic accesses on lock. The atomic access should be combined with proper memory barriers for relaxed atomic accesses. The thing is while (lock) {} will result in a spin lock. Spin locks are known to be a pretty bad solution in many cases unless you really know what you are doing and all the consequence (in doubt, don't use them).
Generally, it is better to uses mutexes, semaphores and wait conditions in this case. Mutexes are generally implemented using an atomic boolean flag internally (with right memory barriers so you do not need to care about that). When the flag is mark as locked, an OS sleeping function is called. The sleeping function wake up when the lock has been released by another thread. This is possible since the thread releasing a lock can send a wake up signal. For more information about this, please read this. In old C, you can use pthread for that. Since C11, you can do that directly using this standard API. For pthread, it is here (do not forget the initialization).

If you really want a spinlock, you need something like:
#include <stdatomic.h>
atomic_flag lock = ATOMIC_FLAG_INIT;
void *routine()
{
printf("Thread Start\n");
for (long i = 0; i < 100000; i++)
{
while (atomic_flag_test_and_set(&lock)) {}
mails++;
atomic_flag_clear(&lock);
}
printf("Thread End\n");
}
However, since you are already using pthreads, you're better off using a pthread_mutex

Jérôme Richard told you about ways in which the compiler could optimize the sense out of your code, but even if you turned all the optimizations off, you still would be left with a race condition. You wrote
while (lock) { }
lock=1;
...critical section...
lock=0;
The problem with that is, suppose lock==0. Two threads racing toward that critical section at the same time could both test lock, and they could both find that lock==0. Then they both would set lock=1, and they both would enter the critical section...
...at the same time.
In order to implement a spin lock,* you need some way for one thread to prevent other threads from accessing the lock variable in between when the first thread tests it, and when the first thread sets it. You need an atomic (i.e., indivisible) "test and set" operation.
Most computer architectures have some kind of specialized op-code that does what you want. It has names like "test and set," "compare and exchange," "load-linked and store-conditional," etc. Chris Dodd's answer shows you how to use a standard C library function that does the right thing on whatever CPU you happen to be using...
...But don't forget what Jérôme said.*
* Jérôme told you that spin locks are a bad idea.

How do OpenMP thread ids work with recursion?

Here is a simple recursive program that splits into two for every recursive call. As expected, the result is 2 + 4 + 8 calls to rec, but the number of threads is always the same: two, and the ids bounce back and forth between 0 and one. I expected each recursive call to retain the id, and that 8 threads would be made in the end. What exactly is going on? Is there any issue with the code?
#include <stdio.h>
#include <omp.h>
void rec(int n) {
if (n == 0)
return;
#pragma omp parallel num_threads(2)
{
printf("Currently at %d -- total %d\n", omp_get_thread_num(), omp_get_num_threads());
rec(n - 1);
}
}
int main() {
omp_set_nested(1);
rec(3);
}

Your code is working as expected by OpenMP standard. In OpenMP documentation you can find the following about omp_get_num_threads:
Summary: The omp_get_num_threads routine returns the number of threads
in the current team.
Binding: The binding region for an omp_get_num_threads region is the
innermost enclosing parallel region.
Effect: The omp_get_num_threads routine returns the number of threads
in the team that is executing the parallel region to which the routine
region binds. If called from the sequential part of a program, this
routine returns 1.
omp_get_thread_num has the same binding region:
The binding region for an omp_get_thread_num region is the innermost
enclosing parallel region.
It means that omp_get_num_threads and omp_get_thread_num bind to the innermost parallel region only, so it does not matter how many nested parallel regions are used. Each of your parallel regions is defined by #pragma omp parallel num_threads(2), therefore the return value of omp_get_num_threads is 2 (as long as you have enough threads available) and the return value of omp_get_thread_num is either 0 or 1.

Accessing the executing thread's private variables within a task in OpenMP

I am trying to learn OpenMP, and have stumbled upon the fact that threads do not retain their own data when executing tasks, but they rather have a copy of the data of the thread which has generated the task. Let me demonstrate it with an example:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
}
}
}
return 0;
}
An example output of this code is as follows:
Thread ID of the #single: 1
thread_id, ID of the executing thread: 1, 2
thread_id, ID of the executing thread: 1, 0
thread_id, ID of the executing thread: 1, 3
thread_id, ID of the executing thread: 1, 1
...
It is evident that the thread_id within the task refers to a copy that is assigned to the thread_id of the thread that has created the task (i.e. the one running the single portion of the code).
What if I wanted to refer the executing thread's own private variables then? Are they unrecoverably shadowed? Is there a clause to make this code output number, same number instead at the end of each line?

I am trying to learn OpenMP, and have stumbled upon the fact that
threads do not retain their own data when executing tasks, but they
rather have a copy of the data of the thread which has generated the
task.
"[T]hreads do not retain their own data" is an odd way to describe it. Attributing data ownership to threads themselves instead of to the tasks they are performing is perhaps the key conceptual problem here. It is absolutely natural and to be expected that a thread performing a given task operates with and on the data environment of that task.
But if you're not accustomed to explicit tasks, then it is understandable that you've gotten away so far without appreciating the distinction here. The (many) constructs that give rise to implicit tasks are generally structured in ways that are not amenable to detecting the difference.
So with your example, yes,
the thread_id within the task refers to a copy that
is assigned to the thread_id of the thread that has created the task
(i.e. the one running the single portion of the code).
Although it may not be immediately obvious, that follows from the OMP specification:
When a thread encounters a task construct, an explicit task is
generated from the code for the associated structured-block. The data
environment of the task is created according to the data-sharing
attribute clauses on the task construct, per-data environment ICVs,
and any defaults that apply.
(OMP 5.0 Specification, section 2.10.1; emphasis added)
The only way that can be satisfied is if the task closes over any shared data from the context of its declaration, which is indeed what you observe. Moreover, this is typically what one wants -- the data on which a task is to operate should be established at the point of and by the context of its declaration, else how would one direct what a task is to do?
What if I wanted to refer the executing thread's own private variables
then?
Threads do not have variables, at least not in the terminology of OMP. Those belong to the "data environment" of whatever tasks they are executing at any given time.
Are they unrecoverably shadowed?
When a thread is executing a given task, it accesses the data environment of that task. That environment may include variables that are shared with other tasks, but only in that sense can it access the variables of another task. "Unrecoverably shadowed" is not the wording I would use to describe the situation, but it gets the idea across.
Is there a clause to make this
code output number, same number instead at the end of each line?
There are ways to restructure the code to achieve that, but none of them are as simple as just adding a clause to the omp task directive. In fact, I don't think any of them involve explicit tasks at all. The most natural way to get that would be with a parallel loop:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 10; i++) {
int thread_id = omp_get_thread_num();
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
return 0;
}
Of course, that also simplifies it to the point where it seems trivial, but perhaps that helps drive home the point. A large part of the purpose of declaring an explicit task is that that task may be executed by a different thread than the one that created it, which is exactly what you need to avoid to achieve the behavior you are asking for.

The problem is, that here you create four parallel threads:
#pragma omp parallel num_threads(4)
and here, you restrict the further execution to one single thread
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
From now on, only the context of this single thread is used, hence the same instance of the variable thread_id is used. Here
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
you indeed distribute the loop iteration on four threads, but based on the state of the single task (together with the corresponding instance of thread_id to which you restricted execution above. So a first measure is to end the single section directly after the printf (before the loop iterations start):
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
}
// Now outside the "single"
for (int i = 0; i < 10; i++) {
...
Now, for each iteration in the for loop, a task is created immediately. And this is performed for each of the four threads. So, you now have 40 tasks pending with
10 x thread_id == 0
10 x thread_id == 1
10 x thread_id == 2
10 x thread_id == 3
These tasks are now distributed amongst the threads arbitrarily. This is where the association between thread_id and the omp thread number gets lost. There is not much you can do about it, except for removing the
#pragma omp task
which leads to a similar result (with corresponding omp thread id and thread_id numbers), but works a bit different internally (the dissociation of the tasks and the omp threads does not take place).

Multithreaded program with mutex on mutual resource [duplicate]

This question already has an answer here:
Pthread_create() incorrect start routine parameter passing
(1 answer)
Closed 3 years ago.
I tried to build a program which should create threads and assign a Print function to each one of them, while the main process should use printf function directly.
Firstly, I made it without any synchronization means and expected to get a randomized output.
Later I tried to add a mutex to the Print function which was assigned to the threads and expected to get a chronological output but it seems like the mutex had no effect about the output.
Should I use a mutex on the printf function in the main process as well?
Thanks in advance
My code:
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
pthread_t threadID[20];
pthread_mutex_t lock;
void* Print(void* _num);
int main(void)
{
int num = 20, indx = 0, k = 0;
if (pthread_mutex_init(&lock, NULL))
{
perror("err pthread_mutex_init\n");
return errno;
}
for (; indx < num; ++indx)
{
if (pthread_create(&threadID[indx], NULL, Print, &indx))
{
perror("err pthread_create\n");
return errno;
}
}
for (; k < num; ++k)
{
printf("%d from main\n", k);
}
indx = 0;
for (; indx < num; ++indx)
{
if (pthread_join(threadID[indx], NULL))
{
perror("err pthread_join\n");
return errno;
}
}
pthread_mutex_destroy(&lock);
return 0;
}
void* Print(void* _indx)
{
pthread_mutex_lock(&lock);
printf("%d from thread\n", *(int*)_indx);
pthread_mutex_unlock(&lock);
return NULL;
}

All questions of program bugs notwithstanding, pthreads mutexes provide only mutual exclusion, not any guarantee of scheduling order. This is typical of mutex implementations. Similarly, pthread_create() only creates and starts threads; it does not make any guarantee about scheduling order, such as would justify an assumption that the threads reach the pthread_mutex_lock() call in the same order that they were created.
Overall, if you want to order thread activities based on some characteristic of the threads, then you have to manage that yourself. You need to maintain a sense of which thread's turn it is, and provide a mechanism sufficient to make a thread notice when it's turn arrives. In some circumstances, with some care, you can do this by using semaphores instead of mutexes. The more general solution, however, is to use a condition variable together with your mutex, and some shared variable that serves as to indicate who's turn it currently is.

The code passes the address of the same local variable to all threads. Meanwhile, this variable gets updated by the main thread.
Instead pass it by value cast to void*.
Fix:
pthread_create(&threadID[indx], NULL, Print, (void*)indx)
// ...
printf("%d from thread\n", (int)_indx);
Now, since there is no data shared between the threads, you can remove that mutex.

All the threads created in the for loop have different value of indx. Because of the operating system scheduler, you can never be sure which thread will run. Therefore, the values printed are in random order depending on the randomness of the scheduler. The second for-loop running in the parent thread will run immediately after creating the child threads. Again, the scheduler decides the order of what thread should run next.
Every OS should have an interrupt (at least the major operating systems have). When running the for-loop in the parent thread, an interrupt might happen and leaves the scheduler to make a decision of which thread to run. Therefore, the numbers being printed in the parent for-loop are printed randomly, because all threads run "concurrently".
Joining a thread means waiting for a thread. If you want to make sure you print all numbers in the parent for loop in chronological order, without letting child thread interrupt it, then relocate the for-loop section to be after the thread joining.

Linux - force single-core execution and debug multi-threading with pthread

I'm debugging a multi-threaded problem with C, pthread and Linux. On my MacOS 10.5.8, C2D, is runs fine, on my Linux computers (2-4 cores) it produces undesired outputs.
I'm not experienced, therefore I attached my code. It's rather simple: each new thread creates two more threads until a maximum is reached. So no big deal... as I thought until a couple of days ago.
Can I force single-core execution to prevent my bugs from occuring?
I profiled the programm execution, instrumenting with Valgrind:
valgrind --tool=drd --read-var-info=yes --trace-mutex=no ./threads
I get a couple of conflicts in the BSS segment - which are caused by my global structs and thread counter variales. However I could mitigate these conflicts with forced signle-core execution because I think the concurrent sheduling of my 2-4 core test-systems are responsible for my errors.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THR 12
#define NEW_THR 2
int wait_time = 0; // log global wait time
int num_threads = 0; // how many threads there are
pthread_t threads[MAX_THR]; // global array to collect threads
pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; // sync
struct thread_data
{
int nr; // nr of thread, serves as id
int time; // wait time from rand()
};
struct thread_data thread_data_array[MAX_THR+1];
void
*PrintHello(void *threadarg)
{
if(num_threads < MAX_THR){
// using the argument
pthread_mutex_lock(&mut);
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
// updates
my_data->nr = num_threads;
my_data->time= rand() % 10 + 1;
printf("Hello World! It's me, thread #%d and sleep time is %d!\n",
my_data->nr,
my_data->time);
pthread_mutex_unlock(&mut);
// counter
long t = 0;
for(t = 0; t < NEW_THR; t++){
pthread_mutex_lock(&mut);
num_threads++;
wait_time += my_data->time;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
sleep(1);
}
printf("Bye from %d thread\n", my_data->nr);
pthread_exit(NULL);
}
return 0;
}
int
main (int argc, char *argv[])
{
long t = 0;
// srand(time(NULL));
if(num_threads < MAX_THR){
for(t = 0; t < NEW_THR; t++){
// -> 2 threads entry point
pthread_mutex_lock(&mut);
// rand time
thread_data_array[num_threads].time = rand() % 10 + 1;
// update global wait time variable
wait_time += thread_data_array[num_threads].time;
num_threads++;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
pthread_mutex_lock(&mut);
printf("In main: creating initial thread #%ld\n", t);
pthread_mutex_unlock(&mut);
}
}
for(t = 0; t < MAX_THR; t++){
pthread_join(threads[t], NULL);
}
printf("Bye from program, wait was %d\n", wait_time);
pthread_exit(NULL);
}
I hope that code isn't too bad. I didn't do too much C for a rather long time. :) The problem is:
printf("Bye from %d thread\n", my_data->nr);
my_data->nr sometimes resolves high integer values:
In main: creating initial thread #0
Hello World! It's me, thread #2 and sleep time is 8!
In main: creating initial thread #1
[...]
Hello World! It's me, thread #11 and sleep time is 8!
Bye from 9 thread
Bye from 5 thread
Bye from -1376900240 thread
[...]
I don't now more ways to profile and debug this.
If I debug this, it works - sometimes. Sometimes it doesn't :(
Thanks for reading this long question. :) I hope I didn't share too much of my currently unresolveable confusion.

Since this program seems to be just an exercise in using threads, with no actual goal, it is difficult to suggest how treat your problem rather than treat the symptom. I believe can actually pin a process or thread to a processor in Linux, but doing so for all threads removes most of the benefit of using threads, and I don't actually remember how to do it. Instead I'm going to talk about some things wrong with your program.
C compilers often make a lot of assumptions when they are doing optimizations. One of the assumptions is that unless the current code being examined looks like it might change some variable that variable does not change (this is a very rough approximation to this, and a more accurate explanation would take a very long time).
In this program you have variables which are shared and changed by different threads. If a variable is only read by threads (either const or effectively const after threads that look at it are created) then you don't have much to worry about (and in "read by threads" I'm including the main original thread) because since the variable doesn't change if the compiler only generates code to read that variable once (remembering it in a local temporary variable) or if it generates code to read it over and over the value is always the same so that calculations based on it always come out the same.
To force the compiler not do this you can use the volatile keyword. It is affixed to variable declarations just like the const keyword, and tells the compiler that the value of that variable can change at any instant, so reread it every time its value is needed, and rewrite it every time a new value for it is assigned.
NOTE that for pthread_mutex_t (and similar) variables you do not need volatile. It if were needed on the type(s) that make up pthread_mutex_t on your system volatile would have been used within the definition of pthread_mutex_t. Additionally the functions that access this type take the address of it and are specially written to do the right thing.
I'm sure now you are thinking that you know how to fix your program, but it is not that simple. You are doing math on a shared variable. Doing math on a variable using code like:
x = x + 1;
requires that you know the old value to generate the new value. If x is global then you have to conceptually load x into a register, add 1 to that register, and then store that value back into x. On a RISC processor you actually have to do all 3 of those instructions, and being 3 instructions I'm sure you can see how another thread accessing the same variable at nearly the same time could end up storing a new value in x just after we have read our value -- making our value old, so our calculation and the value we store will be wrong.
If you know any x86 assembly then you probably know that it has instructions that can do math on values in RAM (both getting from and storing the result in the same location in RAM all in one instruction). You might think that this instruction could be used for this operation on x86 systems, and you would almost be right. The problem is that this instruction is still executed in the steps that the RISC instruction would be executed in, and there are several opportunities for another processor to change this variable at the same time as we are doing our math on it. To get around this on x86 there is a lock prefix that may be applied to some x86 instructions, and I believe that glibc header files include atomic macro functions to do this on architectures that can support it, but this can't be done on all architectures.
To work right on all architectures you are going to need to:
int local_thread_count;
int create_a_thread;
pthread_mutex_lock(&count_lock);
local_thread_count = num_threads;
if (local_thread_count < MAX_THR) {
num_threads = local_thread_count + 1;
pthread_mutex_unlock(&count_lock);
thread_data_array[local_thread_count].nr = local_thread_count;
/* moved this into the creator
* since getting it in the
* child will likely get the
* wrong value. */
pthread_create(&threads[local_thread_count], NULL, PrintHello,
&thread_data_array[local_thread_count]);
} else {
pthread_mutex_unlock(&count_lock);
}
Now, since you would have changed the num_threads to volatile you can atomically test and increment the thread count in all threads. At the end of this local_thread_count should be usable as an index into the array of threads. Note that I did not create but 1 thread in this code, while yours was supposed to create several. I did this to make the example more clear, but it should not be too difficult to change it to go ahead and add NEW_THR to num_threads, but if NEW_THR is 2 and MAX_THR - num_threads is 1 (somehow) then you have to handle that correctly somehow.
Now, all of that being said, there may be another way to accomplish similar things by using semaphores. Semaphores are like mutexes, but they have a count associated with them. You would not get a value to use as the index into the array of threads (the function to read a semaphore count won't really give you this), but I thought that it deserved to be mentioned since it is very similar.
man 3 semaphore.h
will tell you a little bit about it.

num_threads should at least be marked volatile, and preferably marked atomic too (although I believe that the int is practically fine), so that at least there is a higher chance that the different threads are seeing the same values. You might want to view the assembler output to see when the writes of num_thread to memory are actually supposedly taking place.

https://computing.llnl.gov/tutorials/pthreads/#PassingArguments
that seems to be the problem. you need to malloc the thread_data struct.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight