Why is notify required inside a critical section? - c

I'm reading this book here (official link, it's free) to understand threads and parallel programming.
Here's the question.
Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?
What and where is the race condition being described?
The code and passage in question is as follows.
...
The code to wake a thread, which would run in some other thread, looks like this:
pthread_mutex_lock(&lock);
ready = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
A few things to note about this code sequence. First, when signaling (as well as when modifying the global variable ready), we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code.
...
(please refer to the free, official pdf to get context.)
I couldn't comment with a small question in the link-2, so here is a full question.
Edit 1: I understand the lock is to control access to the ready variable. I am wondering why there's a race condition associated with the signaling. Specifically,
First, when signaling [...] we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code
Edit 2: I've seen resources and comments (from links commented below and during my own research), sometimes within the same page that say it doesn't matter or you must put it within a lock for Predictable BehaviorTM (would be nice if this can be touched upon too, if the behavior can be other than spurious wakeups). What must I follow?
Edit 3: I'm looking for more of a 'theoretical' answer, not implementation specific so that I can understand the core idea. I understand answers to these can be platform specific, but an answer that focuses on the core ideas of lock, mutex, condition variable as all implementations must follow these semantics, perhaps adding their own little quirks. Example, wait() can wake up spuriously, and given bad timing of signaling, can happen on 'pure' implementations too. Mentioning these would help.
My apologies for so many edits, but my dearth of in-depth knowledge in this field is confusing the heck outta me.
Any insight would be really helpful, thanks. Also, please feel free to point me to books where I can read these concepts in detail, and where I can learn C++ with these concepts too. Thanks.

Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this
question (and this question too), which basically said "no, it's not
required". Why would a race condition occur?
The book not presenting a complete example, my best guess as to the intended meaning is that there can be a data race with the CV itself if it is signaled without the associated mutex being held. That may be the case for some CV implementations, but the book is talking specifically about pthreads, and pthreads CVs are not subject to such a limitation. Neither is C++ std::condition_variable, which is what the two other SO questions you referred to are talking about. So in that sense, the book is just wrong.
It is true that one can compose examples of poor CV use, in conjunction with which signaling under protection of the associated mutex largely protects against data races, but signaling without such protection is susceptible to data races. But In such a case, the fault is not with the signaling itself, but with the waiting, and if that's what the book means then it is deceptively worded. And probably still wrong.
What and where is the race condition being described?
One can only guess what the author had in mind.
For the record, the proper usage of condition variables involves firstly determining what condition one wants to ensure holds before execution proceeds. That condition will necessarily involve shared variables, else there is no reason to expect that anything another thread does could change whether the condition is satisfied. That being the case, all access to the shared variables involved needs to be protected by a mutex if more than one thread is alive.
That mutex should then, secondly, also be the one associated with the CV, and threads must wait on the CV only while the mutex is held. This is a requirement of every CV implementation I know, and it protects against signals being missed and possible deadlock resulting from that. Consider this faulty, and somewhat contrived, example:
// BAD
int temp;
result = pthread_mutex_lock(m);
// handle failure results ...
temp = shared;
result = pthread_mutex_unlock(m);
// handle failure results ...
if (temp == 0) {
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
// do something ...
Suppose that it was allowed to wait on the CV without holding the mutex, as that code does. That code supposes that at some point in the future, some other thread (T2) will update shared (under protection of the mutex) and then signal the CV to tell the waiting one (T1) that it can proceed. But what if T2 does that between when T1 unlocks the mutex and when it begins its wait? It doesn't matter whether T2 signals the CV under protection of the mutex or not -- T1 will begin a wait for a signal that has already been delivered. And CV signals do not queue.
So suppose that T1 only waits under protection of the mutex, as is in fact required. That's not enough. Consider this:
// ALSO BAD
result = pthread_mutex_lock(m);
// handle failure results ...
if (shared == 0) {
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
result = pthread_mutex_unlock(m);
// handle failure results ...
// do something ...
This is still wrong, because it does not reliably prevent T1 from proceeding past the wait when the condition of interest is unsatisfied. Such a scenario can arise from
the signal being legitimately sent and received even though the particular condition of interest to T1 is not satisfied
the signal being legitimately sent and received, and the condition being satisfied when the signal is sent, but T2 or another thread modifying the shared variable again before T1 returns from its wait.
spurious return from the wait, which is very rare, but does occasionally happen in many real-world implementations.
None of that depends on T2 sending the signal without mutex protection.
The correct way to wait on a condition variable is to check the condition of interest before waiting, and afterward to loop back and check again before proceeding:
// OK
result = pthread_mutex_lock(m);
// handle failure results ...
while (shared == 0) { // <-- 'while', not 'if'
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
// typically, shared = 0 at this point
result = pthread_mutex_unlock(m);
// handle failure results ...
// do something ...
It may sometimes be the case that thread T1 executing that code will return from its wait when the condition is not satisfied, but if ever it does then it will simply return to waiting instead of proceeding when it shouldn't. If other threads signal only under protection of the mutex then that should be rare, but still possible. If other threads signal without mutex protection then T1 may wake more often than strictly needed, but there is no data race involved, and no inherent risk of misbehavior.

Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?
Yes, condition variable notification should generally be performed with the corresponding mutex locked. The reason is not so much to avoid a race condition but to avoid a missed or superfluous notification.
Consider the following piece of code:
std::queue< int > events;
std::mutex mutex;
std::condition_variable cond;
// Thread 1
void consume_events()
{
std::unique_lock< std::mutex > lock(mutex); // #1
while (true)
{
if (events.empty()) // #2
{
cond.wait(lock); // #3
continue;
}
// Process an event
events.pop();
}
}
// Thread 2
void produce_event(int event)
{
{
std::unique_lock< std::mutex > lock(mutex); // #4
events.push(event); // #5
} // #6
cond.notify_one(); // #7
}
This is a classical example of one producer/one consumer queue of data.
In the line #1 the consumer (Thread 1) locks the mutex. Then, in line #2, it tests if there are any events in the queue and, if there are none, in line #3 unlocks mutex and blocks. When the notification on the condition variable happens, the thread unblocks, immediately locks mutex and continues execution past line #3 (which is to go to line #2 again).
In the line #4 the producer (Thread 2) locks the mutex and in line #5 it enqueues a new event. Because the mutex is locked, event queue modification is safe (line #5 cannot be executed concurrently with line #2), so there is no data race. Then, in line #6, the mutex is unlocked and in line #7 the condition variable is notified.
It is possible that the following happens:
Thread 2 acquires the mutex in line #4.
Thread 1 attempts to acquire the mutex in line #1 or #3 (upon being unblocked by a previous notification). Since the mutex is locked by Thread 2, Thread 1 blocks.
Thread 2 enqueues the event in line #5 and unlocks the mutex in line #6.
Thread 1 unblocks and acquires the mutex. In line #2 it sees that the event queue is not empty and processes the event. On the next loop iteration the queue is empty and the thread blocks in line #3.
Thread 2 notifies Thread 1 in line #7. But there are no queued events, and Thread 1 wakes up in vain.
Though in this particular example, the extra wake up is benign, depending on the loop contents, it may be detrimental. The correct code should call notify_one before unlocking the mutex.
Another example is when one thread is used to initiate some work in the other thread without an explicit queue of events:
std::mutex mutex;
std::condition_variable cond;
// Thread 1
void process_work()
{
std::unique_lock< std::mutex > lock(mutex); // #1
while (true)
{
cond.wait(lock); // #2
// Do some processing // #3
}
}
// Thread 2
void initiate_work_processing()
{
cond.notify_one(); // #4
}
In this case Thread 1 waits until it is time to perform some activity (e.g. render a frame in a video game). Thread 2 periodically initiates that activity by notifying Thread 1 via condition variable.
The problem is that the condition variable does not buffer notifications and acts only on the threads that are actually blocked on it at the point of notification. If there are no threads blocked then the notification does nothing. This means that the following sequence of events is possible:
Thread 1 acquires the mutex in line #1 and blocks in line #2.
Thread 2 decides it is time to perform the periodic activity and notifies Thread 1 in line #4.
Thread 1 unblocks and goes to perform the activities (e.g. render a frame).
It turns out that this frame is a lot of work, and when Thread 2 comes to notify Thread 1 about the next frame in line #2, Thread 1 is still busy with the previous one. This notification gets missed.
Thread 1 is finally done with the frame and blocks in line #2. The user observes a frame dropped.
The above wouldn't have happened if Thread 2 locked mutex before notifying Thread 1 in line #4. If Thread 1 is still busy rendering a frame, Thread 2 would block until Thread 1 is done and only then issue the notification.
However, the correct solution for the above task is to introduce a flag or some other data protected by the mutex that Thread 2 can use to signal Thread 1 that it is time to perform its activities. Aside from fixing the missed notification problem, this also takes care of spurious wakeups.
What and where is the race condition being described?
Definition of a data race depends on the memory model used in the particular environment. This means primarily your programming language memory model and may include the underlying hardware memory model (if the programming language relies on the hardware memory model, which is the case with e.g. Assembler).
C++ defines data races as follows:
When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless
both evaluations execute on the same thread or in the same signal handler, or
both conflicting evaluations are atomic operations (see std::atomic), or
one of the conflicting evaluations happens-before another (see std::memory_order)
If a data race occurs, the behavior of the program is undefined.
So basically, when multiple threads access the same memory location concurrently (by means other than std::atomic) and at least one of the threads is modifying the data at that location, that is a data race.

Related

How implement a barrier using semaphores

I have the following problem to solve:
Consider an application where there are three types of threads: Calculus-A,Calculus-B and Finalization. Whenever a thread type Calculus-A ends, it calls the routine endA(), which returns immediately. Whenever a thread type Calculus-B ends, it calls the routine endB(), which returns immediately. Threads like Finalization routine call wait(),
which returns only if they have already completed two Calculation-A threads and 2 Calculation-B threads. In other words, for exactly 2 conclusions of Calculus-A and 2 conclusions of Calculus-B one thread Finalization is allowed to continue.
There is an undetermined number of threads of the 3 types. It is not known the order of the routines called by threads. Threads Completion are answered in the order of arrival.
Implement routines endA(), endB() and wait() using semaphores. Besides the variables initialization, the only possible operations are P and V. Solutions with busy-waiting are not acceptable.
Here's is my solution:
semaphore calcA = 2;
semaphore calcB = 2;
semaphore wait = -3;
void endA()
{
P(calcA);
V(wait);
}
void endB()
{
P(calcB);
V(wait);
}
void wait()
{
P(wait);
P(wait);
P(wait);
P(wait);
V(calcA);
V(calcA);
V(calcB);
V(calcB);
}
I believe that there will be a deadlock due to the wait's initialization and if and wait() executes before endA() and endB(). Is there any other solution for this?
I tend to view semaphore problems as problems where one must identify "sources of waiting" and define for each a semaphore and a protocol for their access.
With that in mind, the "sources of waiting" are
Completions of CalcA
Completions of CalcB
Maybe, if I understood this right, a wait on whole completion groups, consisting of two CalcAs and two CalcBs. I say maybe because I'm not sure what "Threads Completion are answered in the order of arrival." means.
Completions of CalcA and CalcB should therefore increment their respective counters. At the other end, one Finalization thread gains exclusive access to the counters and waits in any order for the needed number of completions to constitute a completion group. It then unlocks access to the next group.
My code is below, although since I'm unfamiliar with the Dutch V and P I will use take()/give().
semaphore calcA = 0;
semaphore calcB = 0;
semaphore groupSem = 1;
void endA(){
give(calcA);
}
void endB(){
give(calcB);
}
void wait(){
take(groupSem);
take(calcA);
take(calcA);
take(calcB);
take(calcB);
give(groupSem);
}
The groupSem semaphore ensures all-or-nothing: the thread that enters the critical section will get the next two completions of each of CalcA and CalcB. If groupSem wasn't there, the first thread to enter wait could take two As and block, then be taken over by another thread that grabs two As and two B and then run away.
A worse problem that exists if the groupSem isn't there is if this second thread takes two As, one B and then blocks, and then the first thread grabs the second B. If somehow the result of the finalization allows more runs of CalculationA and CalculationB, then you may have a deadlock, because there may be no more opportunity for instances of calculation A and B to complete, therefore leaving the finalization threads hanging, unable to produce more calculation instances.

What would happen if pthread_cond_wait was not atomic?

Scenario 1: release mutex then wait
Scenario 2: wait and then release mutex
Trying to understand conceptually what it does.
If the mutex were released before the calling thread is considered "blocked" on the condition variable, then another thread could lock the mutex, change the state that the predicate is based on, and call pthread_cond_signal without the waiting thread ever waking up (since it's not yet blocked). That's the problem.
Scenario 2, waiting then releasing the mutex, is internally how any real-world implementation has to work, since there's no such thing as an atomic implementation of the necessary behavior. But from the application's perspective, there's no way to observe the thread being part of the blocked set without the mutex also being released, so in the sense of the "abstract machine", it's atomic.
Edit: To go into more detail, the real-world implementation of a condition variable wait generally looks like:
Modify some internal state of the condition variable object such that the caller is considered to be part of the blocked set for it.
Unlock the mutex.
Perform a blocking wait operation, with the special property that it will return immediately if the state of the condition variable object from step 1 has changed due to a signal from any other thread.
Thus, the act of "blocking" is split between two steps, one of which happens before the mutex is unlocked (gaining membership in the blocked set) and the other of which happens after the mutex is unlocked (possibly sleeping and yielding control to other threads). It's this split that's able to make the "condition wait" operation "atomic" in the abstract machine.

Is this usage of condition variables ALWAYS subject to a lost-signal race?

Suppose a condition variable is used in a situation where the signaling thread modifies the state affecting the truth value of the predicate and calls pthread_cond_signal without holding the mutex associated with the condition variable? Is it true that this type of usage is always subject to race conditions where the signal may be missed?
To me, there seems to always be an obvious race:
Waiter evaluates the predicate as false, but before it can begin waiting...
Another thread changes state in a way that makes the predicate true.
That other thread calls pthread_cond_signal, which does nothing because there are no waiters yet.
The waiter thread enters pthread_cond_wait, unaware that the predicate is now true, and waits indefinitely.
But does this same kind of race condition always exist if the situation is changed so that either (A) the mutex is held while calling pthread_cond_signal, just not while changing the state, or (B) so that the mutex is held while changing the state, just not while calling pthread_cond_signal?
I'm asking from a standpoint of wanting to know if there are any valid uses of the above not-best-practices usages, i.e. whether a correct condition-variable implementation needs to account for such usages in avoiding race conditions itself, or whether it can ignore them because they're already inherently racy.
The fundamental race here looks like this:
THREAD A THREAD B
Mutex lock
Check state
Change state
Signal
cvar wait
(never awakens)
If we take a lock EITHER on the state change OR the signal, OR both, then we avoid this; it's not possible for both the state-change and the signal to occur while thread A is in its critical section and holding the lock.
If we consider the reverse case, where thread A interleaves into thread B, there's no problem:
THREAD A THREAD B
Change state
Mutex lock
Check state
( no need to wait )
Mutex unlock
Signal (nobody cares)
So there's no particular need for thread B to hold a mutex over the entire operation; it just need to hold the mutex for some, possible infinitesimally small interval, between the state change and signal. Of course, if the state itself requires locking for safe manipulation, then the lock must be held over the state change as well.
Finally, note that dropping the mutex early is unlikely to be a performance improvement in most cases. Requiring the mutex to be held reduces contention over the internal locks in the condition variable, and in modern pthreads implementations, the system can 'move' the waiting thread from waiting on the cvar to waiting on the mutex without waking it up (thus avoiding it waking up only to immediately block on the mutex).
As pointed out in the comments, dropping the mutex may improve performance in some cases, by reducing the number of syscalls needed. Then again it could also lead to extra contention on the condition variable's internal mutex. Hard to say. It's probably not worth worrying about in any case.
Note that the applicable standards require that pthread_cond_signal be safely callable without holding the mutex:
The pthread_cond_signal() or pthread_cond_broadcast() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits [...]
This usually means that condition variables have an internal lock over their internal data structures, or otherwise use some very careful lock-free algorithm.
The state must be modified inside a mutex, if for no other reason than the possibility of spurious wake-ups, which would lead to the reader reading the state while the writer is in the middle of writing it.
You can call pthread_cond_signal anytime after the state is changed. It doesn't have to be inside the mutex. POSIX guarantees that at least one waiter will awaken to check the new state. More to the point:
Calling pthread_cond_signal doesn't guarantee that a reader will acquire the mutex first. Another writer might get in before a reader gets a chance to check the new status. Condition variables don't guarantee that readers immediately follow writers (After all, what if there are no readers?)
Calling it after releasing the lock is actually better, since you don't risk having the just-awoken reader immediately going back to sleep trying to acquire the lock that the writer is still holding.
EDIT: #DietrichEpp makes a good point in the comments. The writer must change the state in such a way that the reader can never access an inconsistent state. It can do so either by acquiring the mutex used in the condition-variable, as I indicate above, or by ensuring that all state-changes are atomic.
The answer is, there is a race, and to eliminate that race, you must do this:
/* atomic op outside of mutex, and then: */
pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);
pthread_cond_signal(&c);
The protection of the data doesn't matter, because you don't hold the mutex when calling pthread_cond_signal anyway.
See, by locking and unlocking the mutex, you have created a barrier. During that brief moment when the signaler has the mutex, there is a certainty: no other thread has the mutex. This means no other thread is executing any critical regions.
This means that all threads are either about to get the mutex to discover the change you have posted, or else they have already found that change and ran off with it (releasing the mutex), or else have not found they are looking for and have atomically given up the mutex to gone to sleep (and are guaranteed to be waiting nicely on the condition).
Without the mutex lock/unlock, you have no synchronization. The signal will sometimes fire as threads which didn't see the changed atomic value are transitioning to their atomic sleep to wait for it.
So this is what the mutex does from the point of view of a thread which is signaling. You can get the atomicity of access from something else, but not the synchronization.
P.S. I have implemented this logic before. The situation was in the Linux kernel (using my own mutexes and condition variables).
In my situation, it was impossible for the signaler to hold the mutex for the atomic operation on shared data. Why? Because the signaler did the operation in user space, inside a buffer shared between the kernel and user, and then (in some situations) made a system call into the kernel to wake up a thread. User space simply made some modifications to the buffer, and then if some conditions were satisfied, it would perform an ioctl.
So in the ioctl call I did the mutex lock/unlock thing, and then hit the condition variable. This ensured that the thread would not miss the wake up related to that latest modification posted by user space.
At first I just had the condition variable signal, but it looked wrong without the involvement of the mutex, so I reasoned about the situation a little bit and realized that the mutex must simply be locked and unlocked to conform to the synchronization ritual which eliminates the lost wakeup.

Deferred bcast wakeup for condition variables - is it valid?

I'm implementing pthread condition variables (based on Linux futexes) and I have an idea for avoiding the "stampede effect" on pthread_cond_broadcast with process-shared condition variables. For non-process-shared cond vars, futex requeue operations are traditionally (i.e. by NPTL) used to requeue waiters from the cond var's futex to the mutex's futex without waking them up, but this is in general impossible for process-shared cond vars, because pthread_cond_broadcast might not have a valid pointer to the associated mutex. In a worst case scenario, the mutex might not even be mapped in its memory space.
My idea for overcoming this issue is to have pthread_cond_broadcast only directly wake one waiter, and have that waiter perform the requeue operation when it wakes up, since it does have the needed pointer to the mutex.
Naturally there are a lot of ugly race conditions to consider if I pursue this approach, but if they can be overcome, are there any other reasons such an implementation would be invalid or undesirable? One potential issue I can think of that might not be able to be overcome is the race where the waiter (a separate process) responsible for the requeue gets killed before it can act, but it might be possible to overcome even this by putting the condvar futex in the robust mutex list so that the kernel performs a wake on it when the process dies.
There may be waiters belonging to multiple address spaces, each of which has mapped the mutex associated with the futex at a different address in memory. I'm not sure if FUTEX_REQUEUE is safe to use when the requeue point may not be mapped at the same address in all waiters; if it does then this isn't a problem.
There are other problems that won't be detected by robust futexes; for example, if your chosen waiter is busy in a signal handler, you could be kept waiting an arbitrarily long time. [As discussed in the comments, these are not an issue]
Note that with robust futexes, you must set the value of the futex & 0x3FFFFFFF to be the TID of the thread to be woken up; you must also set bit FUTEX_WAITERS on if you want a wakeup. This means that you must choose which thread to awaken from the broadcasting thread, or you will be unable to deal with thread death immediately after the FUTEX_WAKE. You'll also need to deal with the possibility of the thread dying immediately before the waker thread writes its TID into the state variable - perhaps having a 'pending master' field that is also registered in the robust mutex system would be a good idea.
I see no reason why this can't work, then, as long as you make sure to deal with the thread exit issues carefully. That said, it may be best to simply define in the kernel an extension to FUTEX_WAIT that takes a requeue point and comparison value as an argument, and let the kernel handle this in a simple, race-free manner.
I just don't see why you assume that the corresponding mutex might not be known. It is clearly stated
The effect of using more than one mutex for concurrent
pthread_cond_timedwait() or pthread_cond_wait() operations on the same
condition variable
is undefined; that is, a condition variable becomes bound to
a unique mutex when a thread waits on the condition variable, and this
(dynamic)
binding shall end when the wait returns.
So even for process shared mutexes and conditions this must hold, and any user space process must always have mapped the same and unique mutex that is associated to the condition.
Allowing users to associate different mutexes to a condition at the same time is nothing that I would support.

Can a correct fail-safe process-shared barrier be implemented on Linux?

In a past question, I asked about implementing pthread barriers without destruction races:
How can barriers be destroyable as soon as pthread_barrier_wait returns?
and received from Michael Burr with a perfect solution for process-local barriers, but which fails for process-shared barriers. We later worked through some ideas, but never reached a satisfactory conclusion, and didn't even begin to get into resource failure cases.
Is it possible on Linux to make a barrier that meets these conditions:
Process-shared (can be created in any shared memory).
Safe to unmap or destroy the barrier from any thread immediately after the barrier wait function returns.
Cannot fail due to resource allocation failure.
Michael's attempt at solving the process-shared case (see the linked question) has the unfortunate property that some kind of system resource must be allocated at wait time, meaning the wait can fail. And it's unclear what a caller could reasonably do when a barrier wait fails, since the whole point of the barrier is that it's unsafe to proceed until the remaining N-1 threads have reached it...
A kernel-space solution might be the only way, but even that's difficult due to the possibility of a signal interrupting the wait with no reliable way to resume it...
This is not possible with the Linux futex API, and I think this can be proven as well.
We have here essentially a scenario in which N processes must be reliably awoken by one final process, and further no process may touch any shared memory after the final awakening (as it may be destroyed or reused asynchronously). While we can awaken all processes easily enough, the fundamental race condition is between the wakeup and the wait; if we issue the wakeup before the wait, the straggler never wakes up.
The usual solution to something like this is to have the straggler check a status variable atomically with the wait; this allows it to avoid sleeping at all if the wakeup has already occurred. However, we cannot do this here - as soon as the wakeup becomes possible, it is unsafe to touch shared memory!
One other approach is to actually check if all processes have gone to sleep yet. However, this is not possible with the Linux futex API; the only indication of number of waiters is the return value from FUTEX_WAKE; if it returns less than the number of waiters you expected, you know some weren't asleep yet. However, even if we find out we haven't woken enough waiters, it's too late to do anything - one of the processes that did wake up may have destroyed the barrier already!
So, unfortunately, this kind of immediately-destroyable primitive cannot be constructed with the Linux futex API.
Note that in the specific case of one waiter, one waker, it may be possible to work around the problem; if FUTEX_WAKE returns zero, we know nobody has actually been awoken yet, so you have a chance to recover. Making this into an efficient algorithm, however, is quite tricky.
It's tricky to add a robust extension to the futex model that would fix this. The basic problem is, we need to know when N threads have successfully entered their wait, and atomically awaken them all. However, any of those threads may leave the wait to run a signal handler at any time - indeed, the waker thread may also leave the wait for signal handlers as well.
One possible way that may work, however, is an extension to the keyed event model in the NT API. With keyed events, threads are released from the lock in pairs; if you have a 'release' without a 'wait', the 'release' call blocks for the 'wait'.
This in itself isn't enough due to the issues with signal handlers; however, if we allow for the 'release' call to specify a number of threads to be awoken atomically, this works. You simply have each thread in the barrier decrement a count, then 'wait' on a keyed event on that address. The last thread 'releases' N - 1 threads. The kernel doesn't allow any wake event to be processed until all N-1 threads have entered this keyed event state; if any thread leaves the futex call due to signals (including the releasing thread), this prevents any wakeups at all until all threads are back.
After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.
Handling destruction is easy: Simply make the pthread_barrier_destroy function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)
Handling unmapping is also easy, but non-local: ensure that munmap or mmap with the MAP_FIXED flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on the munmap rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0). munmap and mmap can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.
Well, I think I can do it with a clumsy approach...
Have the "barrier" be its own process listening on a socket. Implement barrier_wait as:
open connection to barrier process
send message telling barrier process I am waiting
block in read() waiting for reply
Once N threads are waiting, the barrier process tells all of them to proceed. Each waiter then closes its connection to the barrier process and continues.
Implement barrier_destroy as:
open connection to barrier process
send message telling barrier process to go away
close connection
Once all connections are closed and the barrier process has been told to go away, it exits.
[Edit: Granted, this allocates and destroys a socket as part of the wait and release operations. But I think you can implement the same protocol without doing so; see below.]
First question: Does this protocol actually work? I think it does, but maybe I do not understand the requirements.
Second question: If it does work, can it be simulated without the overhead of an extra process?
I believe the answer is "yes". You can have each thread "take the role of" the barrier process at the appropriate time. You just need a master mutex, held by whichever thread is currently "taking the role" of the barrier process. Details, details... OK, so the barrier_wait might look like:
lock(master_mutex);
++waiter_count;
if (waiter_count < N)
cond_wait(master_condition_variable, master_mutex);
else
cond_broadcast(master_condition_variable);
--waiter_count;
bool do_release = time_to_die && waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Here master_mutex (a mutex), master_condition_variable (a condition variable), waiter_count (an unsigned integer), N (another unsigned integer), and time_to_die (a Boolean) are all shared state allocated and initialized by barrier_init. waiter_count is initialiazed to zero, time_to_die to false, and N to the number of threads the barrier is waiting for.
Then barrier_destroy would be:
lock(master_mutex);
time_to_die = true;
bool do_release = waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Not sure about all the details concerning signal handling etc... But the basic idea of "last one out turns off the lights" is workable, I think.

Resources