How to properly terminate a thread in a signal handler?

How to properly terminate a thread in a signal handler? - c

I want to set up a signal handler for SIGSEGV, SIGILL and possibly a few other signals that, rather than terminating the whole process, just terminates the offending thread and perhaps sets a flag somewhere so that a monitoring thread can complain and start another thread. I'm not sure there is a safe way to do this. Pthreads seems to provide functions for exiting the current thread, as well as canceling another thread, but these potentially call a bunch of at-exit handlers. Even if they don't, it seems as though there are many situations in which they are not async-signal-safe, although it is possible that those situations are avoidable. Is there a lower-level function I can call that just destroys the thread? Assuming I modify my own data structures in an async-signal-safe way, and acquire no mutexes, are there pthread/other global data structures that could be left in an inconsistent state simply by a thread terminating at a SIGSEGV? malloc comes to mind, but malloc itself shouldn't SIGSEGV/SIGILL unless the libc is buggy. I realize that POSIX is very conservative here, and makes no guarantees. As long as there's a way to do this in practice I'm happy. Forking is not an option, btw.

If the SIGSEGV/SIGILL/etc. happens in your own code, the signal handler will not run in an async-signal context (it's fundamentally a synchronous signal, but would still be an AS context if it happened inside a standard library function), so you can legally call pthread_exit from the signal handler. However, there are still issues that make this practice dubious:
SIGSEGV/SIGILL/etc. never occur in a program whose behavior is defined unless you generate them via raise, kill, pthread_kill, sigqueue, etc. (and in some of these special cases, they would be asynchronous signals). Otherwise, they're indicative of a program having undefined behavior. If the program has invoked undefined behavior, all bets are off. UB is not isolated to a particular thread or a particular sequence in time. If the program has UB, its entire output/behavior is meaningless.
If the program's state is corrupted (e.g. due to access-after-free, use of invalid pointers, buffer overflows, ...) it's very possible that the first faulting access will happen inside part of the standard library (e.g. inside malloc) rather than in your code. In this case, the signal handler runs in an AS-safe context and cannot call pthread_exit. Of course the program already has UB anyway (see the above point), but even if you wanted to pretend that's not an issue, you'd still be in trouble.
If your program is experiencing these kinds of crashes, you need to find the cause and fix it, not try to patch around it with signal handlers. Valgrind is your friend. If that's not possible, your best bet is to isolate the crashing code into separate processes where you can reason about what happens if they crash asynchronously, rather than having the crashing code in the same process (where any further reasoning about the code's behavior is invalid once you know it crashes).

Related

C difference between main thread and other threads

Is there a difference between the first thread and other threads created during runtime. Because I have a program where to abort longjmp is used and a thread should be able to terminate the program (exit or abort don't work in my case). Could I safely use pthread_kill_other_threads_np and then longjmp?

I'm not sure what platform you're talking about, but pthread_kill_other_threads_np is not a standard function and not a remotely reasonable operation anymore than free_all_malloced_memory would be. Process termination inherently involves the termination of all threads atomically with respect to each other (they don't see each other terminate).
As for longjmp, while there is nothing wrong with longjmp, you cannot use it to jump to a context in a different thread.
It sounds like you have an XY problem here; you've asked about whether you can use (or how to use) particular tools that are not the right tool for whatever it is you want, without actually explaining what your constraints are.

Shared library unloading and atexit handling - what order?

I'm debugging a sort of weird issue where it looks like a thread that's killed in an atexit handler is accessing a shared library, and is segfaulting because that shared library is unloaded before the exit handler runs. I'm not sure this is actually the issue, but it's a hunch on what might be happening.
When a process terminates (main exits or exit() is called), is the atexit handler the immediate next thing to run? My mind says so, but the segfault I'm seeing seems to say otherwise.
Is there any difference (with regards to exit handling) between main returning (either end of function or with return) and calling exit() directly?

When a process terminates (main exits or exit() is called), is the atexit handler the immediate next thing to run? My mind says so, but the segfault I'm seeing seems to say otherwise.
Not necessarily, you're guaranteed that atexit handlers will run. But that's it, atexit handlers may even be called concurrently with other things. While you're using c remember that c++ may be in the same process. C++ says that atexit may be called concurrently to destructors being run for objects of static duration. This means that atexit is very dangerous and you need to ensure you're being very careful what you call.
Is there any difference (with regards to exit handling) between main returning (either end of function or with return) and calling exit() directly?
According to the documentation: No.
The safest thing to do when tearing down the house: Nothing. Just get out and let the house be torn down. Closing the drapes on the way out isn't worth the time so to speak.

Race conditions can also occur in traditional, single-threaded programs - Clarity

I have read few books on parallel programming over the past few months and I decided to close it off with learning about the posix thread.
I am reading "PThreads programming - A Posix standard for better multiprocessing nutshell-handbook". In chapter 5 ( Pthreads and Unix ) the author talks about handling signals in multi-threaded programs. In the "Threadsafe Library Functions and System Calls" section, the author made a statement that I have not seen in most books that I have read on parallel programming. The statement was:
Race conditions can also occur in traditional, single-threaded programs that use signal handlers or that call routines recursively. A single-threaded program of this kind may have the same routine in progress in various call frames on its process stack.
I find it a little bit tedious to decipher this statement. Does the race condition in the recursive function occur when the recursive function keeps an internal structure by using the static storage type?
I would also love to know how signal handlers can cause RACE CONDITION IN SINGLE THREADED PROGRAMS
Note: Am not a computer science student , i would really appreciate simplified terms

I don't think one can call it a race condition in the classical meaning. Race conditions have a somewhat stochastic behavior, depending on the scheduler policy and timings.
The author is probably talking about bugs that can arise when the same object/resource is accessed from multiple recursive calls. But this behavior is completely deterministic and manageable.
Signals on the other hand is a different story as they occur asynchronously and can apparently interrupt some data processing in the middle and trigger some other processing on that data, corrupting it when returned to the interrupted task.

A signal handler can be called at any time without warning, and it potentially can access any global state in the program.
So, suppose your program has some global flag, that the signal handler sets in response to,... I don't know,... SIGINT. And your program checks the flag before each call to f(x).
if (! flag) {
f(x);
}
That's a data race. There is no guarantee that f(x) will not be called after the signal happens because the signal could sneak in at any time, including right after the "main" program tests the flag.

First it is important to understand what a race condition is. The definition given by Wikipedia is:
Race conditions arise in software when an application depends on the sequence or timing of processes or threads for it to operate properly.
The important thing to note is that a program can behave both properly and improperly based on timing or ordering of execution.
We can fairly easily create "dummy" race conditions in single threaded programs under this definition.
bool isnow(time_t then) {
time_t now = time(0);
return now == then;
}
The above function is a very dumb example and while mostly it will not work, sometimes it will give the correct answer. The correct vs. incorrect behavior depends entirely on timing and so represents a race condition on a single thread.
Taking it a step further we can write another dummy program.
bool printHello() {
sleep(10);
printf("Hello\n");
}
The expected behavior of the above program is to print "Hello" after waiting 10 seconds.
If we send a SIGINT signal 11 seconds after calling our function, everything behaves as expected. If we send a SIGINT signal 3 seconds after calling our function, the program behaves improperly and does not print "Hello".
The only difference between the correct and incorrect behavior was the timing of the SIGINT signal. Thus, a race condition was introduced by signal handling.

I'm going to give a more general answer than you asked for. And this is my own, personal, pragmatic answer, not necessarily one that hews to any official, formal definition of the term "race condition".
Me, I hate race conditions. They lead to huge classes of nasty bugs that are hard to think about, hard to find, and sometimes hard to fix. So I don't like doing programming that's susceptible to race conditions. So I don't do much classically multithreaded programming.
But even though I don't do much multithreaded programming, I'm still confronted by certain classes of what feel to me like race conditions from time to time. Here are the three I try to keep in mind:
The one you mentioned: signal handlers. Receipt of a signal, and calling of a signal handler, is a truly asynchronous event. If you have a data structure of some kind, and you're in the middle of modifying it when a signal occurs, and if your signal handler also tries to modify that same data structure, you've got a race condition. If the code that was interrupted was in the middle of doing something that left the data structure in an inconsistent state, the code in the signal handler might be confused. Note, too, that it's not necessarily code right in the signal handler, but any function called by the signal handler, or called by a function that's called by the signal handler, etc.
Shared OS resources, typically in the filesystem: If your program accesses (or modifies) a file or directory in the filesystem that's also being accessed or modified by another process, you've got a big potential for race conditions. (This is not surprising, because in a computer science sense, multiple processes are multiple threads. They may have separate address spaces meaning they can't interfere with each other that way, but obviously the filesystem is a shared resource where they still can interfere with each other.)
Non-reentrant functions like strtok. If a function maintains internal, static state, you can't have a second call to that function if another instance is active. This is not a "race condition" in the formal sense at all, but it has many of the same symptoms, and also some of the same fixes: don't use static data; do try to write your functions so that they're reentrant.

The author of the book in which you found seems to be defining the term "race condition" in an unusual manner, or maybe he's just used the wrong term.
By the usual definition, no, recursion does not create race conditions in single-threaded programs because the term is defined with respect to the respective actions of multiple threads of execution. It is possible, however, for a recursion to produce exposure to non-reentrancy of some of the functions involved. It's also possible for a single thread to deadlock against itself. These do not reflect race conditions, but perhaps one or both of them is what the author meant.
Alternatively, maybe what you read is the result of a bad editing job. The text you quoted groups functions that employ signal handling together with recursive functions, and signal handlers indeed can produce data races, just as a multiple threads can do, because execution of a signal handler has the relevant characteristics of execution of a separate thread.

Race conditions absolutely happen in single-threaded programs once you have signal handlers. Look at the Unix manual page for pselect().
One way it happens is like this: You have a signal handler that sets some global flag. You check your global flag and because it is clear you make a system call that suspends, confident that when the signal arrives the system call will exit early. But the signal arrives just after you check the global flag and just before the system call takes place. So now you're hung in a system call waiting for a signal that has already arrived. In this case, the race is between your single-threaded code and an external signal.

Well, consider the following code:
#include <pthread.h>
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int num = 2;
void lock_and_call_again() {
pthread_mutex_lock(&mutex);
if(num > 0) {
--num;
lock_and_call_again();
}
}
int main(int argc, char** argv) {
lock_and_call_again();
}
(Compile with gcc -pthread thread-test.c if you safe the code as thread-test.c)
This is clearly single-threaded, isn't it?
Never the less, it will enter a dead-lock, because you try to lock an already locked mutex.
That's basically what is meant within the paragraph you cited, IMHO:
It does not matter whether it is done in several threads or one single thread, if you try to lock an already locked mutex, your program will end in an dead-lock.
If a function calls itself, like lock_and_call above, it what is called a recursive call .
Just as james large explains, a signal can occur any time, and if a signal handler is registered with this signal, it will called at unpredictable times, if no measures are taken, even while the same handler is already being executed - yielding some kind of implicit recursive execution of the signal handler.
If this handler aquires some kind of a lock, you end up in a dead-lock, even without a function calling itself explicitly.
Consider the following function:
pthread_mutex_t mutex;
void my_handler(int s) {
pthread_mutex_lock(&mutex);
sleep(10);
pthread_mutex_unnlock(&mutex);
}
Now if you register this function for a particular signal, it will be called whenever the signal is caught by your program. If the handler has been called and sleeps, it might get interrupted, the handler called again, and the handler try to lock the mutex that is already locked.
Regarding the wording of the citation:
"A single-threaded program of this kind may have the same routine in progress in various call frames on its process stack."
When a function gets called, some information is stored on the process's stack - e.g. the return address. This information is called a call frame. If you call a function recursively, like in the example above, this information gets stored on the stack several times - several call frames are stored.
It's stated a littlebit clumsy, I admit...

Is pthread_kill() dangerous to use?

I have read that TerminateThread() in WinAPI is dangerous to use.
Is pthread_kill() in Linux also dangerous to use?
Edit: Sorry I meant pthread_kill() and not pthread_exit().

To quote Sir Humphrey Appleby, the answer is "yes, and no".
In and of itself calling pthread_exit() is not dangerous and is called implicitly when your thread exits its method. However, there are a few "gotchas" if you call it manually.
All cleanup handlers are called when this is called. So if you call this method, then access some memory that the cleanup handlers have cleaned up, you get a memory error.
Similarly, after this is called, any local and thread-local variables for the thread become invalid. So if a reference is made to them you can get a memory error.
If this has already been called for the thread (implicitly or explicitly) calling it again has an undefined behaviour, and
If this is the last thread in your process, this will cause the process to exit.
If you are careful of the above (i.e. if you are careful to not reference anything about the thread after you have called pthread_exit) then it is safe to call call manually. However, if you are using C++ instead of C I would highly recommend using the std::thread class rather than doing it manually. It is easier to read, involves less code, and ensures that you are not breaking any of the above.
For more information type "man pthread_exit" which will essentially tell you the above.

Since the question has now been changed, I will write a new answer. My answer still remains "yes and no" but the reasons have changed.
pthread_kill is somewhat dangerous in that it shares the potential timing risks that is inherent in all signal handling systems. In addition there are complexities in dealing with it, specifically you have to setup a signal handler within the thread. However one could argue that it is less dangerous than the Windows function you mention. Here is why:
The Windows function essentially stops the thread, possibly bypassing the proper cleanup. It is intended as a last resort option. pthread_kill, on the other hand, does not terminate the thread at all. It simply sends a signal to the thread that the thread can respond to.
In order for this to do something you need to have registered in the thread what signals you want it to handle. If your goal is to use pthread_kill to terminate the thread, you can use this by having your signal handler set a flag that the thread can access, and having the thread check the flag and exit when it gets set. You may be able to call pthread_exit from the signal handler (I've never tried that) but it strikes me as being a bad idea since the signal comes asynchronously, and your thread is not guaranteed to still be running. The flag option I mention solves this provided that the flag is not local to the thread, allowing the signal handler to set it even if the target thread has already exited. Of course if you are doing this, you don't really need pthread_kill at all, as you can simply have your main thread set the flag at the appropriate time.
There is another option for stopping another thread - the pthread_cancel method. This method will place a cancel request on the target thread and, if the thread has been configured to be cancellable (you generally do this in the pthread_create, but you can also do it after the fact), then the next time the thread reaches a potential cancellation point (specified by pthread_testcancel but also automatically handled by many system routines such as the IO calls), it will exit. This is also safer than what Windows does as it is not violently stopping the thread - it only stops at well defined points. But it is more work than the Windows version as you have to configure the thread properly.
The Wikipedia page for "posix threads" describes some of this (but not much), but it has a pretty good "See also" and "References" section that will give you more details.

Glib hash table issues with signal handling code

I've got some system level code that fires timers every once in a while, and has a signal handler that manages these signals when they arrive. This works fine and seems completely reasonable. There are also two separate threads running alongside the main program, but they do not share any variables, but use glib's async queues to pass messages in one direction only.
The same code uses glib's GHashTable to store, well, key/value pairs. When the signal code is commented out of the system, the hash table appears to operate fine. When it is enabled, however, there is a strange race condition where the call to g_hash_table_lookup actually returns NULL (meaning that there is no entry with the key used to look it up), when indeed the entry is actually there (yes I made sure by printing the whole list of key/value pairs with g_hash_table_foreach). Why would this occur most of the time? Is GLib's hash table implementation buggy? Sometimes the lookup call is successful.
It's a very particular situation, and I can clarify further if it didn't make sense, but I'm hoping I am doing something wrong so that this can actually be fixed.
More info: The code segments that are not within the signal handler scope but access the g_hash_table variable are surrounded by signal blocking calls so that the signal handler does not access these variables when the process was originally accessing them too.

Generally, signal handlers can only set flags and make system calls
As it happens, there are severe restrictions in ISO C regarding what signal handlers can do, and most library entry points and most API's are not even remotely 100% multi-thread-safe and approximately 0.0% of them are signal-handler-safe. That is, there is an absolute prohibition against calling almost anything from a signal handler.
In particular, for GHashTable, g_hash_table_ref() and g_hash_table_unref() are the only API elements that are even thread-safe, and none of them are signal-handler safe. Actually, ISO-C only allows signal handlers to modify objects declared with volatile sig_atomic_t and only a couple of library routines may be called.
Some of us consider threaded systems to be intrinsically dangerous, practically radioactive sources of subtle bugs. A good place to start worrying is The Problem with Threads. (And note that signal handlers themselves are much worse. No one thinks an API is safe there...)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight