PCRE pcre_exec thread safe? - c

I have a C program that uses a PCRE regex to determine if a process in a cgroup should be added to one variable or another. I spawn a thread to read the cpuacct.stat file in each running cgroup, where the number of threads never exceeded the number of cores. These samples and results are then combined into one of two variables.
The relevant snippet of code is:
pcreExecRet = pcre_exec(reCompiled,
pcreExtra,
queue,
strlen(queue), // length of string
0, // Start looking at this point
0, // OPTIONS
subStrVec,
30); // Length of subStrVec
//CRITICAL SECTION?
pthread_mutex_lock(&t_lock); //lock mutex
while (sumFlag == 0) {
pthread_cond_wait(&ok_add, &t_lock); //wait on ok signal
}
if(pcreExecRet > 0) {
sumOne += loadavg;
} else if (pcreExecRet == PCRE_ERROR_NOMATCH){
sumTwo += loadavg;
} else {
perror("Could not determine sum!\n"); //if this fails
}
sumFlag = 1;
pthread_cond_signal(&ok_add); //signal that it is ok to add
pthread_mutex_unlock(&t_lock); //unlock mutex
My question is whether or not the pcre_exec() call is thread-safe? Should it be moved into the critical section? I know the compiled regex is thread safe, but I'm not sure about pcreExtra (const pcre_extra) or subStrVec (int *ovector). These variables are global for now.

Yes it is thread safe, all PCRE functions are but you should be careful under certain conditions
The following is from the manual pages for PCRE
MULTITHREADING
The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout and stack-checking functions pointed to by pcre_callout and
pcre_stack_guard, are shared by all threads.
The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
If the just-in-time optimization feature is being used, it needs sepa-
rate memory stack areas for each thread. See the pcrejit documentation
for more details.

Related

Use while loop to make a thread wait till the lock variable is set to avoid race condition in C prgramming

#include <stdio.h>
#include <pthread.h>
long mails = 0;
int lock = 0;
void *routine()
{
printf("Thread Start\n");
for (long i = 0; i < 100000; i++)
{
while (lock)
{
}
lock = 1;
mails++;
lock = 0;
}
printf("Thread End\n");
}
int main(int argc, int *argv[])
{
pthread_t p1, p2;
if (pthread_create(&p1, NULL, &routine, NULL) != 0)
{
return 1;
}
if (pthread_create(&p2, NULL, &routine, NULL) != 0)
{
return 2;
}
if (pthread_join(p1, NULL) != 0)
{
return 3;
}
if (pthread_join(p2, NULL) != 0)
{
return 4;
}
printf("Number of mails: %ld \n", mails);
return 0;
}
In the above code each thread runs a for loop to increase the value
of mails by 100000.
To avoid race condition is used lock variable
along with while loop.
Using while loop in routine function does not
help to avoid race condition and give correct output for mails
variable.
In C, the compiler can safely assume a (global) variable is not modified by other threads unless in few cases (eg. volatile variable, atomic accesses). This means the compiler can assume lock is not modified and while (lock) {} can be replaced with an infinite loop. In fact, this kind of loop cause an undefined behaviour since it does not have any visible effect. This means the compiler can remove it (or generate a wrong code). The compiler can also remove the lock = 1 statement since it is followed by lock = 0. The resulting code is bogus. Note that even if the compiler would generate a correct code, some processor (eg. AFAIK ARM and PowerPC) can reorder instructions resulting in a bogus behaviour.
To make sure accesses between multiple threads are correct, you need at least atomic accesses on lock. The atomic access should be combined with proper memory barriers for relaxed atomic accesses. The thing is while (lock) {} will result in a spin lock. Spin locks are known to be a pretty bad solution in many cases unless you really know what you are doing and all the consequence (in doubt, don't use them).
Generally, it is better to uses mutexes, semaphores and wait conditions in this case. Mutexes are generally implemented using an atomic boolean flag internally (with right memory barriers so you do not need to care about that). When the flag is mark as locked, an OS sleeping function is called. The sleeping function wake up when the lock has been released by another thread. This is possible since the thread releasing a lock can send a wake up signal. For more information about this, please read this. In old C, you can use pthread for that. Since C11, you can do that directly using this standard API. For pthread, it is here (do not forget the initialization).
If you really want a spinlock, you need something like:
#include <stdatomic.h>
atomic_flag lock = ATOMIC_FLAG_INIT;
void *routine()
{
printf("Thread Start\n");
for (long i = 0; i < 100000; i++)
{
while (atomic_flag_test_and_set(&lock)) {}
mails++;
atomic_flag_clear(&lock);
}
printf("Thread End\n");
}
However, since you are already using pthreads, you're better off using a pthread_mutex
Jérôme Richard told you about ways in which the compiler could optimize the sense out of your code, but even if you turned all the optimizations off, you still would be left with a race condition. You wrote
while (lock) { }
lock=1;
...critical section...
lock=0;
The problem with that is, suppose lock==0. Two threads racing toward that critical section at the same time could both test lock, and they could both find that lock==0. Then they both would set lock=1, and they both would enter the critical section...
...at the same time.
In order to implement a spin lock,* you need some way for one thread to prevent other threads from accessing the lock variable in between when the first thread tests it, and when the first thread sets it. You need an atomic (i.e., indivisible) "test and set" operation.
Most computer architectures have some kind of specialized op-code that does what you want. It has names like "test and set," "compare and exchange," "load-linked and store-conditional," etc. Chris Dodd's answer shows you how to use a standard C library function that does the right thing on whatever CPU you happen to be using...
...But don't forget what Jérôme said.*
* Jérôme told you that spin locks are a bad idea.

How do I secure that all threads created in a function return before the very same function terminates?

Consider the following section of a C function:
for (int i = 0; i < n; ++i) {
thread_arg *arg = (thread_arg *) malloc(sizeof(thread_arg));
arg->random_value = random_value;
arg->message = &(message[i * 10]);
if (pthread_create(NULL, NULL, thread_start, (void *) &arg)) {
perror("pthread_create");
exit(EXIT_FAILURE);
}
}
In this for loop, I create n threads which all perform a common routine with different parameters. This for loop is part of a bigger function which returns a data structure which gets modified by all threads in parallel. Thus, it is important that this bigger function won't return before all threads are done.
I was hoping to find a simpler way then giving an individual ID to all these threads and joining afterwards with pthread_join.Is there any general approach to say to a function something like "hey, don't return until all threads you've created returned"?
There are at least two other ways:
Use pthread barriers. The name barrier is used in a completely different sense than you usually hear it when talking about concurrency. Here, it's a synchronization primitive that lets each of a set of threads (waiters on it) block until all of them have reached it, then unblocks them all together. You'd first initialize the barrier in some shared location with n+1 as the count, then have both the function itself and all the n threads it created call pthread_barrier_wait before finishing. Assuming you do it this way, after returning from the wait, the n threads can no longer access the shared state; they need to immediately return.
Create the same thing (or a simplified version of it) with a condvar and mutex. Have a count, protected by a mutex, of how many of the n threads are still working. The function that created them can then do:
pthread_mutex_lock(&cnt_mtx);
while (count > 0) pthread_cond_wait(&cnt_cv, &cnt_mtx);
pthread_mutex_unlock(&cnt_mtx);
Generally, though, I'd use pthread_join here. That's what it's for.

Multithreaded program with mutex on mutual resource [duplicate]

This question already has an answer here:
Pthread_create() incorrect start routine parameter passing
(1 answer)
Closed 3 years ago.
I tried to build a program which should create threads and assign a Print function to each one of them, while the main process should use printf function directly.
Firstly, I made it without any synchronization means and expected to get a randomized output.
Later I tried to add a mutex to the Print function which was assigned to the threads and expected to get a chronological output but it seems like the mutex had no effect about the output.
Should I use a mutex on the printf function in the main process as well?
Thanks in advance
My code:
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
pthread_t threadID[20];
pthread_mutex_t lock;
void* Print(void* _num);
int main(void)
{
int num = 20, indx = 0, k = 0;
if (pthread_mutex_init(&lock, NULL))
{
perror("err pthread_mutex_init\n");
return errno;
}
for (; indx < num; ++indx)
{
if (pthread_create(&threadID[indx], NULL, Print, &indx))
{
perror("err pthread_create\n");
return errno;
}
}
for (; k < num; ++k)
{
printf("%d from main\n", k);
}
indx = 0;
for (; indx < num; ++indx)
{
if (pthread_join(threadID[indx], NULL))
{
perror("err pthread_join\n");
return errno;
}
}
pthread_mutex_destroy(&lock);
return 0;
}
void* Print(void* _indx)
{
pthread_mutex_lock(&lock);
printf("%d from thread\n", *(int*)_indx);
pthread_mutex_unlock(&lock);
return NULL;
}
All questions of program bugs notwithstanding, pthreads mutexes provide only mutual exclusion, not any guarantee of scheduling order. This is typical of mutex implementations. Similarly, pthread_create() only creates and starts threads; it does not make any guarantee about scheduling order, such as would justify an assumption that the threads reach the pthread_mutex_lock() call in the same order that they were created.
Overall, if you want to order thread activities based on some characteristic of the threads, then you have to manage that yourself. You need to maintain a sense of which thread's turn it is, and provide a mechanism sufficient to make a thread notice when it's turn arrives. In some circumstances, with some care, you can do this by using semaphores instead of mutexes. The more general solution, however, is to use a condition variable together with your mutex, and some shared variable that serves as to indicate who's turn it currently is.
The code passes the address of the same local variable to all threads. Meanwhile, this variable gets updated by the main thread.
Instead pass it by value cast to void*.
Fix:
pthread_create(&threadID[indx], NULL, Print, (void*)indx)
// ...
printf("%d from thread\n", (int)_indx);
Now, since there is no data shared between the threads, you can remove that mutex.
All the threads created in the for loop have different value of indx. Because of the operating system scheduler, you can never be sure which thread will run. Therefore, the values printed are in random order depending on the randomness of the scheduler. The second for-loop running in the parent thread will run immediately after creating the child threads. Again, the scheduler decides the order of what thread should run next.
Every OS should have an interrupt (at least the major operating systems have). When running the for-loop in the parent thread, an interrupt might happen and leaves the scheduler to make a decision of which thread to run. Therefore, the numbers being printed in the parent for-loop are printed randomly, because all threads run "concurrently".
Joining a thread means waiting for a thread. If you want to make sure you print all numbers in the parent for loop in chronological order, without letting child thread interrupt it, then relocate the for-loop section to be after the thread joining.

4 Process 4 way synchronization using semaphores (In a C Programming, UNIX environment)

I have a question about synchronizing 4 processes in a UNIX environment. It is very important that no process runs their main functionality without first waiting for the others to "be on the same page", so to speak.
Specifically, they should all not go into their loops without first synchronizing with each other. How do I synchronize 4 processes in a 4 way situation, so that none of them get into their first while loop without first waiting for the others? Note that this is mainly a logic problem, not a coding problem.
To keep things consistent between environments let's just say we have a pseudocode semaphore library with the operations semaphore_create(int systemID), semaphore_open(int semaID), semaphore_wait(int semaID), and semaphore_signal(int semaID).
Here is my attempt and subsequent thoughts:
Process1.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_create(123456); //123456 is an arbitrary ID for the semaphore.
int sem2 = semaphore_create(78901); //78901 is an arbitrary ID for the semaphore.
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process, etc (not really relevant)...
}
}
Process2.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process3.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process4.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_wait(sem1);
semaphore_wait(sem1);
semaphore_wait(sem1);
while(true) {
//...do main functionality of process etc...
}
}
We run Process1 first, and it creates all of the semaphores into system memory used in the other processes (the other processes simply call semaphore_open to gain access to those semaphores). Then, all 4 processes have a signal operation, and then a wait. The signal operation causes process1, process2, and process3 to increment the value of sem1 by 1, so it's resultant maximum value is 3 (depending on what order the operating system decides to run these processes in). Process1, 2, and 3, are all waiting then on sem2, and process4 is waiting on sem1 as well. Process 4 then signals sem2 3 times to bring its value back up to 0, and waits on sem1 3 times. Since sem1 was a maximum of 3 from the signalling in the other processes (depending on what order they ran in, again), then it will bring its value back up to 0, and continue running. Thus, all processes will be synchronized.
So yea, not super confident on my answer. I feel that it depends heavily on what order the processes ran in, which is the whole point of synchronization -- that it shouldn't matter what order they run in, they all synchronize correctly. Also, I am doing a lot of work in Process4. Maybe it would be better to solve this using more than 2 semaphores? Wouldn't this also allow for more flexibility within the loops in each process, if I want to do further synchronization?
My question: Please explain why the above logic will or will not work, and/or a solution on how to solve this problem of 4 way synchronization. I'd imagine this is a very common thing to have to think about depending on the industry (eg. banking and synching up bank accounts). I know it is not very difficult, but I have never worked with semaphores before, so I'm kind of confused on how they work.
The precise semantics of your model semaphore library are not clear enough to answer your question definitively. However, if the difference between semaphore_create() and semaphore_open() is that the latter requires the specified semaphore to already exist, whereas the former requires it to not exist, then yes, the whole thing will fall down if process1 does not manage to create the needed semaphores before any of the other processes attempt to open them. (Probably it falls down in different ways if other semantics hold.)
That sort of issue can be avoided in a threading scenario because with threads there is necessarily an initial single-threaded segment wherein the synchronization structures can be initialized. There is also shared memory by which the various threads can communicate with one another. The answer #Dark referred to depends on those characteristics.
The essential problem with a barrier for multiple independent processes -- or for threads that cannot communicate via shared memory and that are not initially synchronized -- is that you cannot know which process needs to erect the barrier. It follows that each one needs to be prepared to do so. That can work in your model library if semaphore_create() can indicate to the caller which result was achieved, one of
semaphore successfully created
semaphore already exists
(or error)
In that case, all participating processes (whose number you must know) can execute the same procedure, maybe something like this:
void process_barrier(int process_count) {
sem_t *sem1, *sem2, *sem3;
int result = semaphore_create(123456, &sem1);
int counter;
switch (result) {
case SEM_SUCCESS:
/* I am the controlling process */
/* Finish setting up the barrier */
semaphore_create(78901, &sem2);
semaphore_create(23432, &sem3);
/* let (n - 1) other processes enter the barrier... */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem1);
}
/* ... and wait for those (n - 1) processes to do so */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_wait(sem2);
}
/* let all the (n - 1) waiting processes loose */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem3);
}
/* and I get to continue, too */
break;
case SEM_EXISTS_ERROR:
/* I am NOT the controlling process */
semaphore_open(123456, &sem1);
/* wait, if necessary, for the barrier to be initialized */
semaphore_wait(sem1);
semaphore_open(78901, &sem2);
semaphore_open(23432, &sem3);
/* signal the controlling process that I have reached the barrier */
semaphore_signal(sem2);
/* wait for the controlling process to allow me to continue */
semaphore_wait(sem3);
break;
}
}
Obviously, I have taken some minor liberties with your library interface, and I have omitted error checks except where they bear directly on the barrier's operation.
The three semaphores involved in that example serve distinct, well-defined purposes. sem1 guards the initialization of the synchronization constructs and allows the processes to choose which among them takes responsibility for controlling the barrier. sem2 serves to count how many processes have reached the barrier. sem3 blocks the non-controlling processes that have reached the barrier until the controlling process releases them all.

How can barriers be destroyable as soon as pthread_barrier_wait returns?

This question is based on:
When is it safe to destroy a pthread barrier?
and the recent glibc bug report:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
I'm not sure about the semaphores issue reported in glibc, but presumably it's supposed to be valid to destroy a barrier as soon as pthread_barrier_wait returns, as per the above linked question. (Normally, the thread that got PTHREAD_BARRIER_SERIAL_THREAD, or a "special" thread that already considered itself "responsible" for the barrier object, would be the one to destroy it.) The main use case I can think of is when a barrier is used to synchronize a new thread's use of data on the creating thread's stack, preventing the creating thread from returning until the new thread gets to use the data; other barriers probably have a lifetime equal to that of the whole program, or controlled by some other synchronization object.
In any case, how can an implementation ensure that destruction of the barrier (and possibly even unmapping of the memory it resides in) is safe as soon as pthread_barrier_wait returns in any thread? It seems the other threads that have not yet returned would need to examine at least some part of the barrier object to finish their work and return, much like how, in the glibc bug report cited above, sem_post has to examine the waiters count after having adjusted the semaphore value.
I'm going to take another crack at this with an example implementation of pthread_barrier_wait() that uses mutex and condition variable functionality as might be provided by a pthreads implementation. Note that this example doesn't try to deal with performance considerations (specifically, when the waiting threads are unblocked, they are all re-serialized when exiting the wait). I think that using something like Linux Futex objects could help with the performance issues, but Futexes are still pretty much out of my experience.
Also, I doubt that this example handles signals or errors correctly (if at all in the case of signals). But I think proper support for those things can be added as an exercise for the reader.
My main fear is that the example may have a race condition or deadlock (the mutex handling is more complex than I like). Also note that it is an example that hasn't even been compiled. Treat it as pseudo-code. Also keep in mind that my experience is mainly in Windows - I'm tackling this more as an educational opportunity than anything else. So the quality of the pseudo-code may well be pretty low.
However, disclaimers aside, I think it may give an idea of how the problem asked in the question could be handled (ie., how can the pthread_barrier_wait() function allow the pthread_barrier_t object it uses to be destroyed by any of the released threads without danger of using the barrier object by one or more threads on their way out).
Here goes:
/*
* Since this is a part of the implementation of the pthread API, it uses
* reserved names that start with "__" for internal structures and functions
*
* Functions such as __mutex_lock() and __cond_wait() perform the same function
* as the corresponding pthread API.
*/
// struct __barrier_wait data is intended to hold all the data
// that `pthread_barrier_wait()` will need after releasing
// waiting threads. This will allow the function to avoid
// touching the passed in pthread_barrier_t object after
// the wait is satisfied (since any of the released threads
// can destroy it)
struct __barrier_waitdata {
struct __mutex cond_mutex;
struct __cond cond;
unsigned waiter_count;
int wait_complete;
};
struct __barrier {
unsigned count;
struct __mutex waitdata_mutex;
struct __barrier_waitdata* pwaitdata;
};
typedef struct __barrier pthread_barrier_t;
int __barrier_waitdata_init( struct __barrier_waitdata* pwaitdata)
{
waitdata.waiter_count = 0;
waitdata.wait_complete = 0;
rc = __mutex_init( &waitdata.cond_mutex, NULL);
if (!rc) {
return rc;
}
rc = __cond_init( &waitdata.cond, NULL);
if (!rc) {
__mutex_destroy( &pwaitdata->waitdata_mutex);
return rc;
}
return 0;
}
int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned int count)
{
int rc;
rc = __mutex_init( &barrier->waitdata_mutex, NULL);
if (!rc) return rc;
barrier->pwaitdata = NULL;
barrier->count = count;
//TODO: deal with attr
}
int pthread_barrier_wait(pthread_barrier_t *barrier)
{
int rc;
struct __barrier_waitdata* pwaitdata;
unsigned target_count;
// potential waitdata block (only one thread's will actually be used)
struct __barrier_waitdata waitdata;
// nothing to do if we only need to wait for one thread...
if (barrier->count == 1) return PTHREAD_BARRIER_SERIAL_THREAD;
rc = __mutex_lock( &barrier->waitdata_mutex);
if (!rc) return rc;
if (!barrier->pwaitdata) {
// no other thread has claimed the waitdata block yet -
// we'll use this thread's
rc = __barrier_waitdata_init( &waitdata);
if (!rc) {
__mutex_unlock( &barrier->waitdata_mutex);
return rc;
}
barrier->pwaitdata = &waitdata;
}
pwaitdata = barrier->pwaitdata;
target_count = barrier->count;
// all data necessary for handling the return from a wait is pointed to
// by `pwaitdata`, and `pwaitdata` points to a block of data on the stack of
// one of the waiting threads. We have to make sure that the thread that owns
// that block waits until all others have finished with the information
// pointed to by `pwaitdata` before it returns. However, after the 'big' wait
// is completed, the `pthread_barrier_t` object that's passed into this
// function isn't used. The last operation done to `*barrier` is to set
// `barrier->pwaitdata = NULL` to satisfy the requirement that this function
// leaves `*barrier` in a state as if `pthread_barrier_init()` had been called - and
// that operation is done by the thread that signals the wait condition
// completion before the completion is signaled.
// note: we're still holding `barrier->waitdata_mutex`;
rc = __mutex_lock( &pwaitdata->cond_mutex);
pwaitdata->waiter_count += 1;
if (pwaitdata->waiter_count < target_count) {
// need to wait for other threads
__mutex_unlock( &barrier->waitdata_mutex);
do {
// TODO: handle the return code from `__cond_wait()` to break out of this
// if a signal makes that necessary
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
} while (!pwaitdata->wait_complete);
}
else {
// this thread satisfies the wait - unblock all the other waiters
pwaitdata->wait_complete = 1;
// 'release' our use of the passed in pthread_barrier_t object
barrier->pwaitdata = NULL;
// unlock the barrier's waitdata_mutex - the barrier is
// ready for use by another set of threads
__mutex_unlock( barrier->waitdata_mutex);
// finally, unblock the waiting threads
__cond_broadcast( &pwaitdata->cond);
}
// at this point, barrier->waitdata_mutex is unlocked, the
// barrier->pwaitdata pointer has been cleared, and no further
// use of `*barrier` is permitted...
// however, each thread still has a valid `pwaitdata` pointer - the
// thread that owns that block needs to wait until all others have
// dropped the pwaitdata->waiter_count
// also, at this point the `pwaitdata->cond_mutex` is locked, so
// we're in a critical section
rc = 0;
pwaitdata->waiter_count--;
if (pwaitdata == &waitdata) {
// this thread owns the waitdata block - it needs to hang around until
// all other threads are done
// as a convenience, this thread will be the one that returns
// PTHREAD_BARRIER_SERIAL_THREAD
rc = PTHREAD_BARRIER_SERIAL_THREAD;
while (pwaitdata->waiter_count!= 0) {
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
};
__mutex_unlock( &pwaitdata->cond_mutex);
__cond_destroy( &pwaitdata->cond);
__mutex_destroy( &pwaitdata_cond_mutex);
}
else if (pwaitdata->waiter_count == 0) {
__cond_signal( &pwaitdata->cond);
__mutex_unlock( &pwaitdata->cond_mutex);
}
return rc;
}
17 July 20111: Update in response to a comment/question about process-shared barriers
I forgot completely about the situation with barriers that are shared between processes. And as you mention, the idea I outlined will fail horribly in that case. I don't really have experience with POSIX shared memory use, so any suggestions I make should be tempered with scepticism.
To summarize (for my benefit, if no one else's):
When any of the threads gets control after pthread_barrier_wait() returns, the barrier object needs to be in the 'init' state (however, the most recent pthread_barrier_init() on that object set it). Also implied by the API is that once any of the threads return, one or more of the the following things could occur:
another call to pthread_barrier_wait() to start a new round of synchronization of threads
pthread_barrier_destroy() on the barrier object
the memory allocated for the barrier object could be freed or unshared if it's in a shared memory region.
These things mean that before the pthread_barrier_wait() call allows any thread to return, it pretty much needs to ensure that all waiting threads are no longer using the barrier object in the context of that call. My first answer addressed this by creating a 'local' set of synchronization objects (a mutex and an associated condition variable) outside of the barrier object that would block all the threads. These local synchronization objects were allocated on the stack of the thread that happened to call pthread_barrier_wait() first.
I think that something similar would need to be done for barriers that are process-shared. However, in that case simply allocating those sync objects on a thread's stack isn't adequate (since the other processes would have no access). For a process-shared barrier, those objects would have to be allocated in process-shared memory. I think the technique I listed above could be applied similarly:
the waitdata_mutex that controls the 'allocation' of the local sync variables (the waitdata block) would be in process-shared memory already by virtue of it being in the barrier struct. Of course, when the barrier is set to THEAD_PROCESS_SHARED, that attribute would also need to be applied to the waitdata_mutex
when __barrier_waitdata_init() is called to initialize the local mutex & condition variable, it would have to allocate those objects in shared memory instead of simply using the stack-based waitdata variable.
when the 'cleanup' thread destroys the mutex and the condition variable in the waitdata block, it would also need to clean up the process-shared memory allocation for the block.
in the case where shared memory is used, there needs to be some mechanism to ensured that the shared memory object is opened at least once in each process, and closed the correct number of times in each process (but not closed entirely before every thread in the process is finished using it). I haven't thought through exactly how that would be done...
I think these changes would allow the scheme to operate with process-shared barriers. the last bullet point above is a key item to figure out. Another is how to construct a name for the shared memory object that will hold the 'local' process-shared waitdata. There are certain attributes you'd want for that name:
you'd want the storage for the name to reside in the struct pthread_barrier_t structure so all process have access to it; that means a known limit to the length of the name
you'd want the name to be unique to each 'instance' of a set of calls to pthread_barrier_wait() because it might be possible for a second round of waiting to start before all threads have gotten all the way out of the first round waiting (so the process-shared memory block set up for the waitdata might not have been freed yet). So the name probably has to be based on things like process id, thread id, address of the barrier object, and an atomic counter.
I don't know whether or not there are security implications to having the name be 'guessable'. if so, some randomization needs to be added - no idea how much. Maybe you'd also need to hash the data mentioned above along with the random bits. Like I said, I really have no idea if this is important or not.
As far as I can see there is no need for pthread_barrier_destroy to be an immediate operation. You could have it wait until all threads that are still in their wakeup phase are woken up.
E.g you could have an atomic counter awakening that initially set to the number of threads that are woken up. Then it would be decremented as last action before pthread_barrier_wait returns. pthread_barrier_destroy then just could be spinning until that counter falls to 0.

Resources