pthread_exit() in signal handler - c

(This question might be somewhat related to pthread_exit in signal handler causes segmentation fault) I'm writing a leadlock prevention library, where there is always a checking thread doing graph stuff and checks if there is deadlock, if so then it signals one of the conflicting threads. When that thread catches the signal it releases all mutex(es) it owns and exits. There are multiple resource mutexes (obviously) and one critical region mutex, all calls to acquire, release resource lock and do graph calculations must obtain this lock first. Now there goes the problem. With 2 competing (not counting the checking thread) threads, sometimes the program deadlocks after one thread gets killed. In gdb it's saying the dead thread owns critical region lock but never released it. After adding break point in signal handler and stepping through, it appears that lock belongs to someone else (as expected) right before pthread_exit(), but the ownership magically goes to this thread after pthread_exit()..The only guess I can think of is the thread to be killed was blocking at pthread_mutex_lock when trying to gain the critical region lock (because it wanted another resource mutex), then the signal came, interrupting the pthread_mutex_lock. Since this call is not signal-proof, something weird happened? Like the signal handler might have returned and that thread got the lock then exited? Idk.. Any insight is appreciated!

pthread_exit is not async-signal-safe, and thus the only way you can call it from a signal handler is if you ensure that the signal is not interrupting any non-async-signal-safe function.
As a general principle, using signals as a method of communication with threads is usually a really bad idea. You end up mixing two issues that are already difficult enough on their own: thread-safety (proper synchronization between threads) and reentrancy within a single thread.
If your goal with signals is just to instruct a thread to terminate, a better mechanism might be pthread_cancel. To use this safely, however, the thread that will be cancelled must setup cancellation handlers at the proper points and/or disable cancellation temporarily when it's not safe (with pthread_setcancelstate). Also, be aware that pthread_mutex_lock is not a cancellation point. There's no safe way to interrupt a thread that's blocked waiting to obtain a mutex, so if you need interruptability like this, you probably need either a more elaborate synchronization setup with condition variables (condvar waits are cancellable), or you could use semaphores instead of mutexes.
Edit: If you really do need a way to terminate threads waiting for mutexes, you could replace calls to pthread_mutex_lock with calls to your own function that loops calling pthread_mutex_timedlock and checking for an exit flag on each timeout.

Related

How to pthread_barrier_destroy() without waiting for pthread_barrier_wait()

I have a multithreaded application that uses barriers to synchronise worker threads.
At the end of function compute(), threads are cancelled:
...
for(int i=0;i<p; i++){
printf("Thread %lu completed in %d passes\n",threads[i],find_tstat(threads[i])->count);
pthread_cancel(threads[i]);
}
printf("================================================================\n");
return a;
Threads are interrupted in the middle of computation, so they may be in between barriers. This is likely what's causing pthread_barrier_destroy() to hang, is because some barrier_wait() has not returned yet.
The question is; how can I still destroy even if a wait() hasn't returned?
Answer to your question is: you can't.
man pthread_barrier_destroy
The results are undefined if pthread_barrier_destroy() is called when any thread is blocked on the barrier
man pthread_cancel
On Linux, cancellation is implemented using signals.
man pthread_barrier_wait
If a signal is delivered to a thread blocked on a barrier, upon return from the signal handler the thread shall resume waiting at the barrier if the barrier wait has not completed (that is, if the required number of threads have not arrived at the barrier during the execution of the signal handler); otherwise, the thread shall continue as normal from the completed barrier wait. Until the thread in the signal handler returns from it, it is unspecified whether other threads may proceed past the barrier once they have all reached it.
A thread that has blocked on a barrier shall not prevent any unblocked thread that is eligible to use the same processing resources from eventually making forward progress in its execution. Eligibility for processing resources shall be determined by the scheduling policy.
As the question is posed:
The question is; how can I still destroy even if a wait() hasn't returned?
the answer is "you can't", as your other answer explains.
However, with good enough record keeping, you can launch just enough extra threads specifically to wait at the barrier in order to let any other threads already waiting pass through. This would likely be tied together with code and data intended to provide for your threads to be shut down cleanly instead of being canceled, which is also something you should do.
On the other hand, it's pretty easy to roll your own barrier with use of a condition variable and mutex, and the result is more flexible. You still should not be canceling threads, but you can make waits at a hand-rolled barrier such as I describe soft-cancelable. This would be my recommendation.

Why do I get EBUSY when trying to pthread_mutex_destroy?

Right before exiting, I call from the main() in the following order to:
pthread_cancel() other threads uses mtx which are "waiting" (They are waiting for other cond_variable and mutex. Maybe that's the problem?
pthread_cond_destroy(&cnd) (which is "coupled" whith mtx)
pthread_mutex_unlock(&mtx)
pthread_mutex_destroy(&mtx)
However, the last function results EBUSY. Each time another thread uses the mutex it almost immediately release it. Also, as mentioned, I kill all those threads before trying to destroy the mutex.
Why is it happening?
As per man pthread_mutex_destroy:
The pthread_mutex_destroy() function may fail if:
EBUSY
The implementation has detected an attempt to destroy the object referenced by mutex while it is locked or referenced (for example,
while being used in a pthread_cond_timedwait() or pthread_cond_wait())
by another thread.
Check if the mutex is not used by another thread when you try to destroy it.
pthread_cancel() other threads uses mtx which are "waiting" (They are waiting for other cond_variable and mutex.
Cancellation is running asynchronously to the cancelling process, that is pthread_cancel() might very well return before the thread to be cancelled ended.
This results in resources (mutexes, conditions, ...) used by the thread to be cancelled perhaps still being in use when immediately calling pthread_mutex_destroy() afterwards.
The only way to test whether cancellation succeeded it to call pthread_join()on the cancelled thread and expect it to return PTHREAD_CANCELED. This implies that the thread to be cancelled wasn't detached.
Here you see one possible issue with cancelling threads. There are others. Simply avoid all this by not using pthread_cancel(), but implement a proper design ending all threads in well defined manner.

sem_wait and signal handler

Why sem_wait cannot be used inside a signal handler (particularly SIGSEGV signal which is per thread)? Can someone give me an example scenario where it will crash the application? I guess sem_wait is both reentrant and thread safe, so what is the problem here? Why is it not async safe?
Async safe is a much stricter requirement than thread safe. You can write thread safe code using primitives to protect global data with critical sections. Signal handlers can't rely on this. For example, you could be inside a critical section within sem_wait, and simultaneously do something that causes a segfault. This would break the thread-safe protections of sem_wait.
sem_wait cannot be used in a signal handler for this reason:
Thread A is calls sem_wait on sem1. When thread A is done, it posts to sem1. However, before it can finish the signal is received and then handler is entered, calling sem_wait on sem1. Since A is the one that would post to sem1, the handler will never return and you will have deadlock. This is why it is a good rule to never wait on anything in a signal handler. The problem, ASFAIK, has more to do with deadlock than crashing.
Also, this violates the ideal purpose of a signal handler, which is to handle an external interrupt and then get back to what you were doing quickly.
Lastly, isn't it a better goal to rid yourself of the SIGSEGV instead of handling it?
What if the application receives a signal while the value of the semaphore is zero, and the thread that receives the signal happens to be the one which is supposed to increment the semaphore value (sem_post)? If you then call sem_wait in the signal handler, the process will deadlock, no?
Another argument could of course be that if sem_wait is not on the list of async-signal-safe functions, the implementation is free to invoke nasal demons.

Is it possible to terminate only the one thread on receiving a SIGSEGV?

I have an application which starts multiple threads.
I am using a signal handler to catch the signals.
I don't want my application to quit on SIGSEGV; I want to terminate only the thread that incurred the signal, and to continue the flow of the full application in the other threads.
Is it possible?
If a SIGSEGV happens, it indicates that your program has already invoked undefined behavior, i.e. the state of the entire program is undefined/indeterminate/invalid. In practice it's possible that you may be able to recover and keep running, but there's no guarantee, and it could be dangerous.
As asveikau mentioned, you could longjmp out of the signal handler and try to clean up, but this could make an even worse mess if the crash happened in the middle of malloc, free, printf, or any function modifying the state of global data or data that's shared with other threads or that will be accessed in the cleanup code at the longjmp destination. The state may be corrupt/inconsistent, and/or locks may be held and left permanently unreleasable.
If you can ensure this won't happen - for example if the misbehaving thread never calls any async-signal-unsafe functions - then it may be safe to longjmp out of the signal handler then call pthread_exit.
An alternative might be to permanently freeze the thread in the signal handler, by adding all signals to the sa_mask for SIGSEGV and then writing for (;;) pause(); in the signal handler. This is 100% "safe", but may leave the process in a deadlocked state if any locks were held by the crashing thread. This is perhaps "less bad" than exposing corrupt state to other threads and further clobbering your data to hell...

Kill Thread in Pthread Library

I use pthread_create(&thread1, &attrs, //... , //...); and need if some condition occured need to kill this thread how to kill this ?
First store the thread id
pthread_create(&thr, ...)
then later call
pthread_cancel(thr)
However, this not a recommended programming practice! It's better to use an inter-thread communication mechanism like semaphores or messages to communicate to the thread that it should stop execution.
Note that pthread_kill(...) does not actually terminate the receiving thread, but instead delivers a signal to it, and it depends on the signal and signal handlers what happens.
There are two approaches to this problem.
Use a signal: The thread installs a signal handler using sigaction() which sets a flag, and the thread periodically checks the flag to see whether it must terminate. When the thread must terminate, issue the signal to it using pthread_kill() and wait for its termination with pthread_join(). This approach requires pre-synchronization between the parent thread and the child thread, to guarantee that the child thread has already installed the signal handler before it is able to handle the termination signal;
Use a cancellation point: The thread terminates whenever a cancellation function is executed. When the thread must terminate, execute pthread_cancel() and wait for its termination with pthread_join(). This approach requires detailed usage of pthread_cleanup_push() and pthread_cleanup_pop() to avoid resource leakage. These last two calls might mess with the lexical scope of the code (since they may be macros yielding { and } tokens) and are very difficult to maintain properly.
(Note that if you have already detached the thread using pthread_detach(), you cannot join it again using pthread_join().)
Both approaches can be very tricky, but either might be specially useful in a given situation.
I agree with Antti, better practice would be to implement some checkpoint(s) where the thread checks if it should terminate. These checkpoints can be implemented in a number of ways e.g.: a shared variable with lock or an event that the thread checks if it is set (the thread can opt to wait zero time).
Take a look at the pthread_kill() function.
pthread_exit(0)
This will kill the thread.

Resources