Single producer/consumer circular buffer which only blocks consumer

Single producer/consumer circular buffer which only blocks consumer - c

I'd like to implement a buffer with single producer and a single consumer, where only the consumer may be blocked. Important detail here is that the producer can drop the update if the queue is full.
I've considered converting a wait-free implementation, but at first glance there seems to be no easy way to notify the consumer new data has arrived without losing notifications. So I settled on the very simple approach below, using a counting semaphore (some error handling details omitted for clarity):
Object ar[SIZE];
int head = 0, tail = 0;
sem_t semItems; // initialized to 0
void enqueue(Object o) {
int val;
sem_getvalue(&semItems, &val);
if (val < SIZE - 1) {
ar[head] = o;
head = (head + 1) % SIZE;
sem_post(&semItems);
}
else {
// dropped
}
}
Object dequeue(void) {
sem_wait(&semItems);
Object o = ar[tail];
tail = (tail + 1) % SIZE;
return o;
}
Are there any safety issues with this code? I was surprised not to see an implementation like it anywhere in the popular literature. An additional question is whether sem_post() would ever block (calls futex_wake() under the hood in linux). Simpler solutions are also welcome of course.
edit: edited code to leave a space between reader and writer (see Mayurk's response).

I can see one problem in this implementation. Consider the following sequence.
Assume buffer is full, but consumer is not yet started. So (head=0, tail=0, sem_val=SIZE).
dequeue() is called from consumer thread. sem_wait() succeeds. So just at that instance (head=0, tail=0, sem_val=SIZE-1). Consumer starts reading ar[0].
Now there is a thread switch. enqueue() is called from producer thread. sem_getvalue() would return SIZE-1. So producer writes at ar[0].
Basically I think you need mutex protection for reading and writing operations. But adding mutex might block the threads. So I am not sure whether you get expected behavior from this logic.

Related

Is it possible to make a thread-safe queue with simultaneous push and pop?

This is more of a theoretical rather than useful question, but I cannot think of a way to do this. To be clear, the goal is to make a queue which allows multiple threads to both push and pop simultaneously without blocking each other.
An example of the goal is:
Thread A pushes twice. Thread A and thread B call push, thread C and thread D call pop. Thread A gets queue location 2, thread B gets queue location 3, thread C gets queue location 0, and thread D gets queue location 1. All threads are able to read from/write to their respective locations at the same time.
These push and pop functions can do this, but are not thread-safe:
#include <pthreads.h>
#include <semaphore.h>
...
void push(void* item) {
sem_wait(push_avaliability);
int slot = atomic_fetch_add(tail, 1) % queue_size;
pthread_mutex_lock(access_queue[slot]);
queue[slot] = item;
pthread_mutex_unlock(access_queue[slot]);
sem_post(pop_avaliability);
}
void* pop() {
sem_wait(pop_avaliability);
int slot = atomic_fetch_add(head, 1) % queue_size;
pthread_mutex_lock(access_queue[slot]);
void* item = queue[slot];
pthread_mutex_unlock(access_queue[slot]);
sem_post(push_avaliability);
}
head and tail are both initialized to 0, push_avaliability is initialized to the size of the queue, and pop_avaliability is initialized to 0. These are the only two functions that modify these variables. (I know that there is an overflow potential with head/tail and that storing pointers in the array makes the thread-safety of the queue unimportant, but those problems are not important for this question.)
An example of the problem is:
Suppose that thread A and thread B call push and thread C calls pop. Thread A gets slot 0 but does not lock it, then thread B gets slot 1 and writes to it and posts that there is a space written to. Thread C wakes and gets slot 0, and attempts to read from it, but thread A has not written to it yet.
I could fix this problem by incrementing head/tail and writing/reading inside a mutex that prevented access to the whole queue by any other thread, but I want to know if it is possible to do this in a way that allows multiple threads to be writing to and reading from the queue at the same time.

This is a solution I am using:
void push(void* item) {
sem_wait(push_avaliability);
int slot = atomic_fetch_add(tail, 1) % queue_size;
while(slot_full[slot] != false);
queue[slot] = item;
slot_full[slot] = true;
sem_post(pop_avaliability);
}
void* pop() {
sem_wait(pop_avaliability);
int slot = atomic_fetch_add(head, 1) % queue_size;
while(slot_full[slot] != true);
void* item = queue[slot];
slot_full[slot] = false;
sem_post(push_avaliability);
return item;
}
(slot_full[i] is initialized to false)
The while loop does not use a significant amount of time because the semaphores mostly handle waiting for the slot to open. (In testing, the while loop condition was never true.) If the copy operation was expensive, then using a condition variable would be the way to go, but I think that would probably be more expensive on average than this solution.
This paper was a useful reference for lock-free queues, along with mpoeter's comment.

How to rewrite GCD code as POSIX in C

This question is a follow-up to a question which happened to be more complex than I had initially thought would be. In a program I'm writing the main thread takes care of GUI-driven data updates, a producer thread (with a number of sub-threads, because the producer task is "embarrassingly parallel") writes to the circular buffer, while the real-time consumer thread reads from it. Original platform of development was OSX/Darwin, but I'd like to make the code more portable, UNIX source compatible. Everything can easily be written in POSIX, except for the following OSX-specific GCD command for which I can't estimate a POSIX equivalent, if any. It launches the producer thread, from which its subthreads are being launched programmatically, depending on the number of available logical CPU cores:
void dproducer (bool on, int cpuNum, uData* data)
{
if (on == true)
{
data->state = starting;
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
producerSum(on, cpuNum, data);
});
}
return;
}
This is the block diagram of the program:
For clarity I'm also adding the producerSum code. It's an infinite loop whose execution can either wait for the consumer thread to do the work, or get interrupted by changing data->state, which has global scope:
void producerSum(bool on, int cpuNum, uData* data)
{
int rc;
pthread_t threads[cpuNum]; //subthreads
tData thread_args[cpuNum];
void* resulT;
static float frames [4096];
while(on){
memset(frames, 0, 4096*sizeof(float));
if( (fbuffW = (float**)calloc(cpuNum + 1, sizeof(float*)))!= NULL)
for (int i=0; i<cpuNum; ++i){
fbuffW[i] = (float*)calloc(data->frames, sizeof(float));
thread_args[i].tid = i; //ord. number of thread
thread_args[i].cpuCount = cpuNum; //counter increment step
thread_args[i].data = data;
rc = pthread_create(&threads[i], NULL, producerTN, (void *) &thread_args[i]);
if(rc != 0)printf("rc = %s\n",strerror(rc));
}
for (int i=0; i<cpuNum; ++i) rc = pthread_join(threads[i], &resulT);
//each subthread writes to fbuffW[i] and results get summed
for(UInt32 samp = 0; samp < data->frames; samp++)
for(int i = 0; i < cpuNum; i++)
frames[samp] += fbuffW[i][samp];
switch (data->state) { ... } //graceful interruption, resuming and termination mechanism
{ … } //simple code for copying frames[] to the circular buffer
pthread_cond_wait (&cond, &mutex);//wait for the real-time consumer
for(int i = 0; i < cpuNum; i++) free(fbuffW[i]); free(fbuffW);
} //end while(on)
return;
}
The syncing inside the producer thread is being successfully handled by pthread_create( ) and pthread_join( ), while necessary coordination between the producer and consumer threads is being successfully handled by a variable of pthread_mutex_t and a variable of pthread_cond_t (with corresponding locking, unlocking, broadcasting and waiting commands). uData is a program defined struct (or class instance). Any direction where to look at would help indeed.
Thanks for reading this post!

A dispatch queue is just what it sounds like: a queue, as in the standard FIFO list data structure. It holds tasks. Those tasks can be represented by Objective-C Blocks as in your code or by function pointers and context pointer values. You'll presumably need to avoid Blocks if you're aiming for cross-platform compatibility. In fact, since you only ever dispatch one task, your tasks can just encapsulate the parameters (on, cpuNum, and data) and not the code (the call to producerSum()).
The queues are serviced by threads from a thread pool. GCD manages the threads and pool. At least on OS X, there's integration with the OS such that the pool's size is governed by overall system load, which you won't be able to reproduce in a cross-platform manner.
Operations on a dispatch queue are thread-safe. This includes adding tasks to them and the worker threads removing tasks from them.
You're going to have to implement all of this. It's definitely possible, but it will be a bother. In many ways, the queues and the thread pool constitute a producer-consumer architecture. Basically, your GCD-based solution was a bit of a cheat because you just used a producer-consumer API to implement your producer-consumer design. Now, you're going to have to really implement a producer-consumer design without the crutch.
There's basically no more to it than the thread-creation and POSIX condition variables you're already using.
dispatch_async() is basically just locking the mutex for the queue of tasks, adding the task to the queue, signalling the condition variable, and unlocking the mutex. Each worker thread will just wait on the condition variable and, when it wakes, lock the mutex, pop a task off the queue if there's one, unlock the mutex, and run the task if it got one. You probably also need a mechanism to signal the worker thread that it's time to gracefully terminate.

Signalling to threads waiting at a lock that the lock has become irrelevant

I have a hash table implementation in C where each location in the table is a linked list (to handle collisions). These linked lists are inherently thread safe and so no additional thread-safe code needs to be written at the hash table level if the table is a constant size - the hash table is thread-safe.
However, I would like the hash table to dynamically expand as values were added so as to maintain a reasonable access time. For the table to expand though, it needs additional thread-safety.
For the purposes of this question, procedures which can safely occur concurrently are 'benign' and the table resizing procedure (which cannot occur concurrently) is 'critical'. Threads currently using the list are known as 'users'.
My first solution to this was to put 'preamble' and 'postamble' code for all the critical function which locks a mutex and then waits until there are no current users proceeding. Then I added preamble and postamble code to the benign functions to check if a critical function was waiting, and if so to wait at the same mutex until the critical section is done.
In pseudocode the pre/post-amble functions SHOULD look like:
benignPreamble(table) {
if (table->criticalIsRunning) {
waitUntilSignal;
}
incrementUserCount(table);
}
benignPostamble(table) {
decrementUserCount(table);
}
criticalPreamble(table) {
table->criticalIsRunning = YES;
waitUntilZero(table->users);
}
criticalPostamble(table) {
table->criticalIsRunning = NO;
signalCriticalDone();
}
My actual code is shown at the bottom of this question and uses (perhaps unnecessarily) caf's PriorityLock from this SO question. My implementation, quite frankly, smells awful. What is a better way to handle this situation? At the moment I'm looking for a way to signal to a mutex that it is irrelevant and 'unlock all waiting threads' simultaneously, but I keep thinking there must be a simpler way. I am trying to code it in such a way that any thread-safety mechanisms are 'ignored' if the critical process is not running.
Current Code
void startBenign(HashTable *table) {
// Ignores if critical process can't be running (users >= 1)
if (table->users == 0) {
// Blocks if critical process is running
PriorityLockLockLow(&(table->lock));
PriorityLockUnlockLow(&(table->lock));
}
__sync_add_and_fetch(&(table->users), 1);
}
void endBenign(HashTable *table) {
// Decrement user count (baseline is 1)
__sync_sub_and_fetch(&(table->users), 1);
}
int startCritical(HashTable *table) {
// Get the lock
PriorityLockLockHigh(&(table->lock));
// Decrement user count BELOW baseline (1) to hit zero eventually
__sync_sub_and_fetch(&(table->users), 1);
// Wait for all concurrent threads to finish
while (table->users != 0) {
usleep(1);
}
// Once we have zero users (any new ones will be
// held at the lock) we can proceed.
return 0;
}
void endCritical(HashTable *table) {
// Increment back to baseline of 1
__sync_add_and_fetch(&(table->users), 1);
// Unlock
PriorityLockUnlockHigh(&(table->lock));
}

It looks like you're trying to reinvent the reader-writer lock, which I believe pthreads provides as a primitive. Have you tried using that?
More specifically, your benign functions should be taking a "reader" lock, while your critical functions need a "writer" lock. The end result will be that as many benign functions can execute as desired, but when a critical function starts executing it will wait until no benign functions are in process, and will block additional benign functions until it has finished. I think this is what you want.

Concurrent Queue, C

So, I am trying to implement a concurrent queue in C. I have split the methods into "read methods" and "write methods". So, when accessing the write methods, like push() and pop(), I acquire a writer lock. And the same for the read methods. Also, we can have several readers but only one writer.
In order to get this to work in code, I have a mutex lock for the entire queue. And two condition locks - one for the writer and the other for the reader. I also have two integers keeping track of the number of readers and writers currently using the queue.
So my main question is - how to implement several readers accessing the read methods at the same time?
At the moment this is my general read method code: (In psuedo code - not C. I am actually using pthreads).
mutex.lock();
while (nwriter > 0) {
wait(&reader);
mutex.unlock();
}
nreader++;
//Critical code
nreader--;
if (nreader == 0) {
signal(&writer)
}
mutex.unlock
So, imagine we have a reader which holds the mutex. Now any other reader which comes along, and tries to get the mutex, would not be able to. Wouldn't it block? Then how are many readers accessing the read methods at the same time?
Is my reasoning correct? If yes, how to solve the problem?

If this is not for an exercise, use read-write lock from pthreads (pthread_rwlock_* functions).
Also note that protecting individual calls with a lock stil might not provide necessary correctness guarantees. For example, a typical code for popping an element from STL queue is
if( !queue.empty() ) {
data = queue.top();
queue.pop();
}
And this will fail in concurrent code even if locks are used inside the queue methods, because conceptually this code must be an atomic transaction, but the implementation does not provide such guarantees. A thread may pop a different element than it read by top(), or attempt to pop from empty queue, etc.

Please find the following read\write functions.
In my functions, I used canRead and canWrite mutexes and nReads for number of readers:
Write function:
lock(canWrite) // Wait if mutex if not free
// Write
unlock(canWrite)
Read function:
lock(canRead) // This mutex protect the nReaders
nReaders++ // Init value should be 0 (no readers)
if (nReaders == 1) // No other readers
{
lock(canWrite) // No writers can enter critical section
}
unlock(canRead)
// Read
lock(canRead)
nReaders--;
if (nReaders == 0) // No more readers
{
unlock(canWrite) // Writer can enter critical secion
}
unlock(canRead)

A classic solution is multiple-readers, single-writer.
A data structure begins with no readers and no writers.
You permit any number of concurrent readers.
When a writer comes along, you block him till all current readers complete; then you let him go (any new readers and writers which come along which the writer is blocked queue up behind him, in order).

You may try this library it is built in c native, lock free, suitable for cross-platform lfqueue,
For Example:-
int* int_data;
lfqueue_t my_queue;
if (lfqueue_init(&my_queue) == -1)
return -1;
/** Wrap This scope in other threads **/
int_data = (int*) malloc(sizeof(int));
assert(int_data != NULL);
*int_data = i++;
/*Enqueue*/
while (lfqueue_enq(&my_queue, int_data) == -1) {
printf("ENQ Full ?\n");
}
/** Wrap This scope in other threads **/
/*Dequeue*/
while ( (int_data = lfqueue_deq(&my_queue)) == NULL) {
printf("DEQ EMPTY ..\n");
}
// printf("%d\n", *(int*) int_data );
free(int_data);
/** End **/
lfqueue_destroy(&my_queue);

How can barriers be destroyable as soon as pthread_barrier_wait returns?

This question is based on:
When is it safe to destroy a pthread barrier?
and the recent glibc bug report:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
I'm not sure about the semaphores issue reported in glibc, but presumably it's supposed to be valid to destroy a barrier as soon as pthread_barrier_wait returns, as per the above linked question. (Normally, the thread that got PTHREAD_BARRIER_SERIAL_THREAD, or a "special" thread that already considered itself "responsible" for the barrier object, would be the one to destroy it.) The main use case I can think of is when a barrier is used to synchronize a new thread's use of data on the creating thread's stack, preventing the creating thread from returning until the new thread gets to use the data; other barriers probably have a lifetime equal to that of the whole program, or controlled by some other synchronization object.
In any case, how can an implementation ensure that destruction of the barrier (and possibly even unmapping of the memory it resides in) is safe as soon as pthread_barrier_wait returns in any thread? It seems the other threads that have not yet returned would need to examine at least some part of the barrier object to finish their work and return, much like how, in the glibc bug report cited above, sem_post has to examine the waiters count after having adjusted the semaphore value.

I'm going to take another crack at this with an example implementation of pthread_barrier_wait() that uses mutex and condition variable functionality as might be provided by a pthreads implementation. Note that this example doesn't try to deal with performance considerations (specifically, when the waiting threads are unblocked, they are all re-serialized when exiting the wait). I think that using something like Linux Futex objects could help with the performance issues, but Futexes are still pretty much out of my experience.
Also, I doubt that this example handles signals or errors correctly (if at all in the case of signals). But I think proper support for those things can be added as an exercise for the reader.
My main fear is that the example may have a race condition or deadlock (the mutex handling is more complex than I like). Also note that it is an example that hasn't even been compiled. Treat it as pseudo-code. Also keep in mind that my experience is mainly in Windows - I'm tackling this more as an educational opportunity than anything else. So the quality of the pseudo-code may well be pretty low.
However, disclaimers aside, I think it may give an idea of how the problem asked in the question could be handled (ie., how can the pthread_barrier_wait() function allow the pthread_barrier_t object it uses to be destroyed by any of the released threads without danger of using the barrier object by one or more threads on their way out).
Here goes:
/*
* Since this is a part of the implementation of the pthread API, it uses
* reserved names that start with "__" for internal structures and functions
*
* Functions such as __mutex_lock() and __cond_wait() perform the same function
* as the corresponding pthread API.
*/
// struct __barrier_wait data is intended to hold all the data
// that `pthread_barrier_wait()` will need after releasing
// waiting threads. This will allow the function to avoid
// touching the passed in pthread_barrier_t object after
// the wait is satisfied (since any of the released threads
// can destroy it)
struct __barrier_waitdata {
struct __mutex cond_mutex;
struct __cond cond;
unsigned waiter_count;
int wait_complete;
};
struct __barrier {
unsigned count;
struct __mutex waitdata_mutex;
struct __barrier_waitdata* pwaitdata;
};
typedef struct __barrier pthread_barrier_t;
int __barrier_waitdata_init( struct __barrier_waitdata* pwaitdata)
{
waitdata.waiter_count = 0;
waitdata.wait_complete = 0;
rc = __mutex_init( &waitdata.cond_mutex, NULL);
if (!rc) {
return rc;
}
rc = __cond_init( &waitdata.cond, NULL);
if (!rc) {
__mutex_destroy( &pwaitdata->waitdata_mutex);
return rc;
}
return 0;
}
int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned int count)
{
int rc;
rc = __mutex_init( &barrier->waitdata_mutex, NULL);
if (!rc) return rc;
barrier->pwaitdata = NULL;
barrier->count = count;
//TODO: deal with attr
}
int pthread_barrier_wait(pthread_barrier_t *barrier)
{
int rc;
struct __barrier_waitdata* pwaitdata;
unsigned target_count;
// potential waitdata block (only one thread's will actually be used)
struct __barrier_waitdata waitdata;
// nothing to do if we only need to wait for one thread...
if (barrier->count == 1) return PTHREAD_BARRIER_SERIAL_THREAD;
rc = __mutex_lock( &barrier->waitdata_mutex);
if (!rc) return rc;
if (!barrier->pwaitdata) {
// no other thread has claimed the waitdata block yet -
// we'll use this thread's
rc = __barrier_waitdata_init( &waitdata);
if (!rc) {
__mutex_unlock( &barrier->waitdata_mutex);
return rc;
}
barrier->pwaitdata = &waitdata;
}
pwaitdata = barrier->pwaitdata;
target_count = barrier->count;
// all data necessary for handling the return from a wait is pointed to
// by `pwaitdata`, and `pwaitdata` points to a block of data on the stack of
// one of the waiting threads. We have to make sure that the thread that owns
// that block waits until all others have finished with the information
// pointed to by `pwaitdata` before it returns. However, after the 'big' wait
// is completed, the `pthread_barrier_t` object that's passed into this
// function isn't used. The last operation done to `*barrier` is to set
// `barrier->pwaitdata = NULL` to satisfy the requirement that this function
// leaves `*barrier` in a state as if `pthread_barrier_init()` had been called - and
// that operation is done by the thread that signals the wait condition
// completion before the completion is signaled.
// note: we're still holding `barrier->waitdata_mutex`;
rc = __mutex_lock( &pwaitdata->cond_mutex);
pwaitdata->waiter_count += 1;
if (pwaitdata->waiter_count < target_count) {
// need to wait for other threads
__mutex_unlock( &barrier->waitdata_mutex);
do {
// TODO: handle the return code from `__cond_wait()` to break out of this
// if a signal makes that necessary
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
} while (!pwaitdata->wait_complete);
}
else {
// this thread satisfies the wait - unblock all the other waiters
pwaitdata->wait_complete = 1;
// 'release' our use of the passed in pthread_barrier_t object
barrier->pwaitdata = NULL;
// unlock the barrier's waitdata_mutex - the barrier is
// ready for use by another set of threads
__mutex_unlock( barrier->waitdata_mutex);
// finally, unblock the waiting threads
__cond_broadcast( &pwaitdata->cond);
}
// at this point, barrier->waitdata_mutex is unlocked, the
// barrier->pwaitdata pointer has been cleared, and no further
// use of `*barrier` is permitted...
// however, each thread still has a valid `pwaitdata` pointer - the
// thread that owns that block needs to wait until all others have
// dropped the pwaitdata->waiter_count
// also, at this point the `pwaitdata->cond_mutex` is locked, so
// we're in a critical section
rc = 0;
pwaitdata->waiter_count--;
if (pwaitdata == &waitdata) {
// this thread owns the waitdata block - it needs to hang around until
// all other threads are done
// as a convenience, this thread will be the one that returns
// PTHREAD_BARRIER_SERIAL_THREAD
rc = PTHREAD_BARRIER_SERIAL_THREAD;
while (pwaitdata->waiter_count!= 0) {
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
};
__mutex_unlock( &pwaitdata->cond_mutex);
__cond_destroy( &pwaitdata->cond);
__mutex_destroy( &pwaitdata_cond_mutex);
}
else if (pwaitdata->waiter_count == 0) {
__cond_signal( &pwaitdata->cond);
__mutex_unlock( &pwaitdata->cond_mutex);
}
return rc;
}
17 July 20111: Update in response to a comment/question about process-shared barriers
I forgot completely about the situation with barriers that are shared between processes. And as you mention, the idea I outlined will fail horribly in that case. I don't really have experience with POSIX shared memory use, so any suggestions I make should be tempered with scepticism.
To summarize (for my benefit, if no one else's):
When any of the threads gets control after pthread_barrier_wait() returns, the barrier object needs to be in the 'init' state (however, the most recent pthread_barrier_init() on that object set it). Also implied by the API is that once any of the threads return, one or more of the the following things could occur:
another call to pthread_barrier_wait() to start a new round of synchronization of threads
pthread_barrier_destroy() on the barrier object
the memory allocated for the barrier object could be freed or unshared if it's in a shared memory region.
These things mean that before the pthread_barrier_wait() call allows any thread to return, it pretty much needs to ensure that all waiting threads are no longer using the barrier object in the context of that call. My first answer addressed this by creating a 'local' set of synchronization objects (a mutex and an associated condition variable) outside of the barrier object that would block all the threads. These local synchronization objects were allocated on the stack of the thread that happened to call pthread_barrier_wait() first.
I think that something similar would need to be done for barriers that are process-shared. However, in that case simply allocating those sync objects on a thread's stack isn't adequate (since the other processes would have no access). For a process-shared barrier, those objects would have to be allocated in process-shared memory. I think the technique I listed above could be applied similarly:
the waitdata_mutex that controls the 'allocation' of the local sync variables (the waitdata block) would be in process-shared memory already by virtue of it being in the barrier struct. Of course, when the barrier is set to THEAD_PROCESS_SHARED, that attribute would also need to be applied to the waitdata_mutex
when __barrier_waitdata_init() is called to initialize the local mutex & condition variable, it would have to allocate those objects in shared memory instead of simply using the stack-based waitdata variable.
when the 'cleanup' thread destroys the mutex and the condition variable in the waitdata block, it would also need to clean up the process-shared memory allocation for the block.
in the case where shared memory is used, there needs to be some mechanism to ensured that the shared memory object is opened at least once in each process, and closed the correct number of times in each process (but not closed entirely before every thread in the process is finished using it). I haven't thought through exactly how that would be done...
I think these changes would allow the scheme to operate with process-shared barriers. the last bullet point above is a key item to figure out. Another is how to construct a name for the shared memory object that will hold the 'local' process-shared waitdata. There are certain attributes you'd want for that name:
you'd want the storage for the name to reside in the struct pthread_barrier_t structure so all process have access to it; that means a known limit to the length of the name
you'd want the name to be unique to each 'instance' of a set of calls to pthread_barrier_wait() because it might be possible for a second round of waiting to start before all threads have gotten all the way out of the first round waiting (so the process-shared memory block set up for the waitdata might not have been freed yet). So the name probably has to be based on things like process id, thread id, address of the barrier object, and an atomic counter.
I don't know whether or not there are security implications to having the name be 'guessable'. if so, some randomization needs to be added - no idea how much. Maybe you'd also need to hash the data mentioned above along with the random bits. Like I said, I really have no idea if this is important or not.

As far as I can see there is no need for pthread_barrier_destroy to be an immediate operation. You could have it wait until all threads that are still in their wakeup phase are woken up.
E.g you could have an atomic counter awakening that initially set to the number of threads that are woken up. Then it would be decremented as last action before pthread_barrier_wait returns. pthread_barrier_destroy then just could be spinning until that counter falls to 0.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight