How to pthread_barrier_destroy() without waiting for pthread_barrier_wait()

How to pthread_barrier_destroy() without waiting for pthread_barrier_wait() - c

I have a multithreaded application that uses barriers to synchronise worker threads.
At the end of function compute(), threads are cancelled:
...
for(int i=0;i<p; i++){
printf("Thread %lu completed in %d passes\n",threads[i],find_tstat(threads[i])->count);
pthread_cancel(threads[i]);
}
printf("================================================================\n");
return a;
Threads are interrupted in the middle of computation, so they may be in between barriers. This is likely what's causing pthread_barrier_destroy() to hang, is because some barrier_wait() has not returned yet.
The question is; how can I still destroy even if a wait() hasn't returned?

Answer to your question is: you can't.
man pthread_barrier_destroy
The results are undefined if pthread_barrier_destroy() is called when any thread is blocked on the barrier
man pthread_cancel
On Linux, cancellation is implemented using signals.
man pthread_barrier_wait
If a signal is delivered to a thread blocked on a barrier, upon return from the signal handler the thread shall resume waiting at the barrier if the barrier wait has not completed (that is, if the required number of threads have not arrived at the barrier during the execution of the signal handler); otherwise, the thread shall continue as normal from the completed barrier wait. Until the thread in the signal handler returns from it, it is unspecified whether other threads may proceed past the barrier once they have all reached it.
A thread that has blocked on a barrier shall not prevent any unblocked thread that is eligible to use the same processing resources from eventually making forward progress in its execution. Eligibility for processing resources shall be determined by the scheduling policy.

As the question is posed:
The question is; how can I still destroy even if a wait() hasn't returned?
the answer is "you can't", as your other answer explains.
However, with good enough record keeping, you can launch just enough extra threads specifically to wait at the barrier in order to let any other threads already waiting pass through. This would likely be tied together with code and data intended to provide for your threads to be shut down cleanly instead of being canceled, which is also something you should do.
On the other hand, it's pretty easy to roll your own barrier with use of a condition variable and mutex, and the result is more flexible. You still should not be canceling threads, but you can make waits at a hand-rolled barrier such as I describe soft-cancelable. This would be my recommendation.

Related

Does sem_post wake up a random process

Suppose 10 processes are waiting on a semaphore using sem_wait().
and an 11th process calls sem_post on that semaphore.
which of the 10 processes will enter the critical block?
Is it like random? All the process will wake up and strive to achieve a lock.
and CPU will provide a lock to one of the processes and the rest will go back to waiting for state

The POSIX standard doesn't specify which thread will be woken up. Moreover, without artificial delays it's impossible for threads to start waiting on a semaphore in a well-defined order.
In practice, it's likely to be the thread which has been waiting the longest, as a queue structure is used to record threads waiting on a synchronization object. It definitely won't be a 'random' thread. But it's also not something you should depend on for the correctness of your code.

Unlock mutex required on cancellation point cleanup while waiting for condition variable?

In the pthread library there is the concept of cancellation points. Most system functions that may block the execution for longer time (or wait on some ressources...) can be aborted by pthread cancellation points.
Guess there is some data protected by a condition variable that is executed in a thread like in pseudo code below. This thread has a setup cleanup procedure that is called in case a cancellation request is made to this thread.
THREAD_CLEANUP_PROC {
UNLOCK(mutex) // Is this unlock required?
}
THREAD_PROC {
SET THREAD_CLEANUP = THREAD_CLEANUP_PROC
LOOP {
LOCK(mutex)
WHILE (condition == false) {
condition.WAIT(mutex) // wait interrupted, cancel point is called
}
// ... we have the lock
UNLOCK(mutex)
condition.NOTIFY_ALL()
read(descriptor); // wait for some data on a file descriptor while lock is not acquired
}
}
If someone cancels the thread (pthread_cancel()) while waiting for the condition variable, the documentation about pthread_cond_wait says that the thread gets unblocked while acquiring the lock and start executing the cleanup handler before the thread ends.
Am I true that the cleanup handler is now responsible for unlocking that lock (mutex)? What if - like in my example - there is another blocking method like read that blocks while waiting for data but without acquiring the lock? In this case that read is also unblocked and the cleanup handler is called as if before. Only this time the cleanup handler shall not unlock the mutex. Am I correct. If so, what is the best way to handle this situation? Are there common concepts that should be followed?

Thread cancellation is messy. Generally speaking, you should not do it.
In the pthread library there is the concept of cancellation points.
Yes.
Most system functions that may block the execution for longer time (or wait on some ressources...) can be aborted by pthread cancellation points.
Not exactly. Many functions such as you describe are cancellation points. A thread with "deferred" cancellation type will abort when it calls a function that is a cancellation point if it is currently cancellable and has a cancellation request pending. That does not imply that such a function can be interrupted by thread cancellation. Threads with "asynchronous" cancellation can be canceled at more or less any time, including when blocking on a long-running task, but cancellation points are irrelevant in that case.
If someone cancels the thread (pthread_cancel()) while waiting for the condition variable, the documentation about pthread_cond_wait says that the thread gets unblocked while acquiring the lock and start executing the cleanup handler before the thread ends.
Yes, provided that the thread has "deferred" cancellation type.
Am I true that the cleanup handler is now responsible for unlocking that lock (mutex)?
Yes. In this case, the thread holds the mutex locked when it commences its cancellation procedure. If it does not unlock the mutex before it terminates then at minimum you're in for a big hassle. Some types of mutexes (supported by pthreads) may provide for a way to recover from this situation, but you would do well to avoid it.
What if - like in my example - there is another blocking method like read that blocks while waiting for data but without acquiring the lock? In this case that read is also unblocked and the cleanup handler is called as if before. Only this time the cleanup handler shall not unlock the mutex. Am I correct.
Again, there are various types of mutex, and the situation may differ depending on which you use, but by far the best choice is to carefully avoid any thread trying to unlock a mutex that it does not hold locked.
If so, what is the best way to handle this situation?
The best way to handle the situation is to avoid it in the first place. Do not use thread cancellation, especially for threads susceptible to such issues, which are in fact common.
Instead, write multithreaded programs carefully to afford yourself alternative means to shut down threads or the whole program in a timely manner. There is a whole host of such techniques, more than I could reasonably summarize in an SO answer.

Your code must be edited to look like:
THREAD_CLEANUP_PROC {
UNLOCK(mutex) // Is this unlock required? YES
}
THREAD_PROC {
LOOP {
LOCK(mutex)
SET THREAD_CLEANUP_PUSH = THREAD_CLEANUP_PROC // After adquire the lock
WHILE (condition == false) {
condition.WAIT(mutex) // wait interrupted, cancel point is called
}
// ... we have the lock
THEAD_CLEANUP_POP(1) // This unlock the mutex and remove the cleanup
// UNLOCK(mutex)
condition.NOTIFY_ALL()
read(descriptor); // wait for some data on a file descriptor while lock is not acquired
}
}

pthread_exit() in signal handler

(This question might be somewhat related to pthread_exit in signal handler causes segmentation fault) I'm writing a leadlock prevention library, where there is always a checking thread doing graph stuff and checks if there is deadlock, if so then it signals one of the conflicting threads. When that thread catches the signal it releases all mutex(es) it owns and exits. There are multiple resource mutexes (obviously) and one critical region mutex, all calls to acquire, release resource lock and do graph calculations must obtain this lock first. Now there goes the problem. With 2 competing (not counting the checking thread) threads, sometimes the program deadlocks after one thread gets killed. In gdb it's saying the dead thread owns critical region lock but never released it. After adding break point in signal handler and stepping through, it appears that lock belongs to someone else (as expected) right before pthread_exit(), but the ownership magically goes to this thread after pthread_exit()..The only guess I can think of is the thread to be killed was blocking at pthread_mutex_lock when trying to gain the critical region lock (because it wanted another resource mutex), then the signal came, interrupting the pthread_mutex_lock. Since this call is not signal-proof, something weird happened? Like the signal handler might have returned and that thread got the lock then exited? Idk.. Any insight is appreciated!

pthread_exit is not async-signal-safe, and thus the only way you can call it from a signal handler is if you ensure that the signal is not interrupting any non-async-signal-safe function.
As a general principle, using signals as a method of communication with threads is usually a really bad idea. You end up mixing two issues that are already difficult enough on their own: thread-safety (proper synchronization between threads) and reentrancy within a single thread.
If your goal with signals is just to instruct a thread to terminate, a better mechanism might be pthread_cancel. To use this safely, however, the thread that will be cancelled must setup cancellation handlers at the proper points and/or disable cancellation temporarily when it's not safe (with pthread_setcancelstate). Also, be aware that pthread_mutex_lock is not a cancellation point. There's no safe way to interrupt a thread that's blocked waiting to obtain a mutex, so if you need interruptability like this, you probably need either a more elaborate synchronization setup with condition variables (condvar waits are cancellable), or you could use semaphores instead of mutexes.
Edit: If you really do need a way to terminate threads waiting for mutexes, you could replace calls to pthread_mutex_lock with calls to your own function that loops calling pthread_mutex_timedlock and checking for an exit flag on each timeout.

Can a correct fail-safe process-shared barrier be implemented on Linux?

In a past question, I asked about implementing pthread barriers without destruction races:
How can barriers be destroyable as soon as pthread_barrier_wait returns?
and received from Michael Burr with a perfect solution for process-local barriers, but which fails for process-shared barriers. We later worked through some ideas, but never reached a satisfactory conclusion, and didn't even begin to get into resource failure cases.
Is it possible on Linux to make a barrier that meets these conditions:
Process-shared (can be created in any shared memory).
Safe to unmap or destroy the barrier from any thread immediately after the barrier wait function returns.
Cannot fail due to resource allocation failure.
Michael's attempt at solving the process-shared case (see the linked question) has the unfortunate property that some kind of system resource must be allocated at wait time, meaning the wait can fail. And it's unclear what a caller could reasonably do when a barrier wait fails, since the whole point of the barrier is that it's unsafe to proceed until the remaining N-1 threads have reached it...
A kernel-space solution might be the only way, but even that's difficult due to the possibility of a signal interrupting the wait with no reliable way to resume it...

This is not possible with the Linux futex API, and I think this can be proven as well.
We have here essentially a scenario in which N processes must be reliably awoken by one final process, and further no process may touch any shared memory after the final awakening (as it may be destroyed or reused asynchronously). While we can awaken all processes easily enough, the fundamental race condition is between the wakeup and the wait; if we issue the wakeup before the wait, the straggler never wakes up.
The usual solution to something like this is to have the straggler check a status variable atomically with the wait; this allows it to avoid sleeping at all if the wakeup has already occurred. However, we cannot do this here - as soon as the wakeup becomes possible, it is unsafe to touch shared memory!
One other approach is to actually check if all processes have gone to sleep yet. However, this is not possible with the Linux futex API; the only indication of number of waiters is the return value from FUTEX_WAKE; if it returns less than the number of waiters you expected, you know some weren't asleep yet. However, even if we find out we haven't woken enough waiters, it's too late to do anything - one of the processes that did wake up may have destroyed the barrier already!
So, unfortunately, this kind of immediately-destroyable primitive cannot be constructed with the Linux futex API.
Note that in the specific case of one waiter, one waker, it may be possible to work around the problem; if FUTEX_WAKE returns zero, we know nobody has actually been awoken yet, so you have a chance to recover. Making this into an efficient algorithm, however, is quite tricky.
It's tricky to add a robust extension to the futex model that would fix this. The basic problem is, we need to know when N threads have successfully entered their wait, and atomically awaken them all. However, any of those threads may leave the wait to run a signal handler at any time - indeed, the waker thread may also leave the wait for signal handlers as well.
One possible way that may work, however, is an extension to the keyed event model in the NT API. With keyed events, threads are released from the lock in pairs; if you have a 'release' without a 'wait', the 'release' call blocks for the 'wait'.
This in itself isn't enough due to the issues with signal handlers; however, if we allow for the 'release' call to specify a number of threads to be awoken atomically, this works. You simply have each thread in the barrier decrement a count, then 'wait' on a keyed event on that address. The last thread 'releases' N - 1 threads. The kernel doesn't allow any wake event to be processed until all N-1 threads have entered this keyed event state; if any thread leaves the futex call due to signals (including the releasing thread), this prevents any wakeups at all until all threads are back.

After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.
Handling destruction is easy: Simply make the pthread_barrier_destroy function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)
Handling unmapping is also easy, but non-local: ensure that munmap or mmap with the MAP_FIXED flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on the munmap rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0). munmap and mmap can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.

Well, I think I can do it with a clumsy approach...
Have the "barrier" be its own process listening on a socket. Implement barrier_wait as:
open connection to barrier process
send message telling barrier process I am waiting
block in read() waiting for reply
Once N threads are waiting, the barrier process tells all of them to proceed. Each waiter then closes its connection to the barrier process and continues.
Implement barrier_destroy as:
open connection to barrier process
send message telling barrier process to go away
close connection
Once all connections are closed and the barrier process has been told to go away, it exits.
[Edit: Granted, this allocates and destroys a socket as part of the wait and release operations. But I think you can implement the same protocol without doing so; see below.]
First question: Does this protocol actually work? I think it does, but maybe I do not understand the requirements.
Second question: If it does work, can it be simulated without the overhead of an extra process?
I believe the answer is "yes". You can have each thread "take the role of" the barrier process at the appropriate time. You just need a master mutex, held by whichever thread is currently "taking the role" of the barrier process. Details, details... OK, so the barrier_wait might look like:
lock(master_mutex);
++waiter_count;
if (waiter_count < N)
cond_wait(master_condition_variable, master_mutex);
else
cond_broadcast(master_condition_variable);
--waiter_count;
bool do_release = time_to_die && waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Here master_mutex (a mutex), master_condition_variable (a condition variable), waiter_count (an unsigned integer), N (another unsigned integer), and time_to_die (a Boolean) are all shared state allocated and initialized by barrier_init. waiter_count is initialiazed to zero, time_to_die to false, and N to the number of threads the barrier is waiting for.
Then barrier_destroy would be:
lock(master_mutex);
time_to_die = true;
bool do_release = waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Not sure about all the details concerning signal handling etc... But the basic idea of "last one out turns off the lights" is workable, I think.

Signalling all threads in a process

Without keeping a list of current threads, I'm trying to see that a realtime signal gets delivered to all threads in my process. My idea is to go about it like this:
Initially the signal handler is installed and the signal is unblocked in all threads.
When one thread wants to send the 'broadcast' signal, it acquires a mutex and sets a global flag that the broadcast is taking place.
The sender blocks the signal (using pthread_sigmask) for itself, and enters a loop repeatedly calling raise(sig) until sigpending indicates that the signal is pending (there were no threads remaining with the signal blocked).
As threads receive the signal, they act on it but wait in the signal handler for the broadcast flag to be cleared, so that the signal will remain masked.
The sender finishes the loop by unblocking the signal (in order to get its own delivery).
When the sender handles its own signal, it clears the global flag so that all the other threads can continue with their business.
The problem I'm running into is that pthread_sigmask is not being respected. Everything works right if I run the test program under strace (presumably due to different scheduling timing), but as soon as I run it alone, the sender receives its own signal (despite having blocked it..?) and none of the other threads ever get scheduled.
Any ideas what might be wrong? I've tried using sigqueue instead of raise, probing the signal mask, adding sleep all over the place to make sure the threads are patiently waiting for their signals, etc. and now I'm at a loss.
Edit: Thanks to psmears' answer, I think I understand the problem. Here's a potential solution. Feedback would be great:
At any given time, I can know the number of threads running, and I can prevent all thread creation and exiting during the broadcast signal if I need to.
The thread that wants to do the broadcast signal acquires a lock (so no other thread can do it at the same time), then blocks the signal for itself, and sends num_threads signals to the process, then unblocks the signal for itself.
The signal handler atomically increments a counter, and each instance of the signal handler waits until that counter is equal to num_threads to return.
The thread that did the broadcast also waits for the counter to reach num_threads, then it releases the lock.
One possible concern is that the signals will not get queued if the kernel is out of memory (Linux seems to have that issue). Do you know if sigqueue reliably informs the caller when it's unable to queue the signal (in which case I would loop until it succeeds), or could signals possibly be silently lost?
Edit 2: It seems to be working now. According to the documentation for sigqueue, it returns EAGAIN if it fails to queue the signal. But for robustness, I decided to just keep calling sigqueue until num_threads-1 signal handlers are running, interleaving calls to sched_yield after I've sent num_threads-1 signals.
There was a race condition at thread creation time, counting new threads, but I solved it with a strange (ab)use of read-write locks. Thread creation is "reading" and the broadcast signal is "writing", so unless there's a thread trying to broadcast, it doesn't create any contention at thread-creation.

raise() sends the signal to the current thread (only), so other threads won't receive it. I suspect that the fact that strace makes things work is a bug in strace (due to the way it works it ends up intercepting all signals sent to the process and re-raising them, so it may be re-raising them in the wrong way...).
You can probably get round that using kill(getpid(), <signal>) to send the signal to the current process as a whole.
However, another potential issue you might see is that sigpending() can indicate that the signal is pending on the process before all threads have received it - all that means is that there is at least one such signal pending for the process, and no CPU has yet become available to run a thread to deliver it...
Can you describe more details of what you're aiming to achieve? And how portable you want it to be? There's almost certainly a better way of doing it (signals are almost always a major headache, especially when mixed with threads...)

In multithreaded program raise(sig) is equivalent to pthread_kill(pthread_self(), sig).
Try kill(getpid(), sig)

Given that you can apparently lock thread creation and destruction, could you not just have the "broadcasting" thread post the required updates to thread-local-state in a per-thread queue, which each thread checks whenever it goes to use the thread-local-state? If there's outstanding update(s), it first applies them.

You are trying to synchronize a set of threads.
From a design pattern point of view the pthread native solution for your problem would be a pthread barrier.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight