Question regarding multiple threads and segfaults - c

What happens when two threads of the same process running on different logical cpu hit a seg fault?

Default action is for the process to exit. If you handle the segfault, I suppose you could try to arrange for just the thread where it happened to terminate. However, since the only things which cause a segfault to occur naturally (as opposed to raise or kill) stem from undefined behavior, the program is in an indeterminate state and you can't rely on being able to recover anything.

Normal handling of a Segmentation Fault involves the termination of the process. That means that both of them are terminated.

I think the default action on all major OSes is to terminate the process. However, you could conceivably install (e.g using signal) an alternate handler that only terminated the thread. Of course, once you have a segmentation fault, behavior typically becomes undefined, and attempting to continue is risky.

Signals generated due to illegal execution are handled synchronously by the kernel. So even if both the threads generate seg fault at the same time, only one gets thru'.

The segfault handler is called for the process. If you don't do anything special, the OS will kill the process and reclaim its resources.

Related

When is it preferable to cause a segfault in a watchdog thread versus exiting normally to stop a process?

I am wondering if there is ever a good reason to exit a watchdog thread in the manner depicted, versus exiting with exit(). In the code I came across that brought this question to mind, a segfault was caused by de-referencing a null pointer with the strange line *(char **)0 = "watchdog timeout";.
Unless I'm mistaken, a thread calling exit() terminates the entire process. I interpret a segfault as an error, and not intended behavior, but perhaps there are times when it is desired.
void *watchdog_loop(void *arg) {
time_t now;
while(foo) {
sleep(1);
now = current_time();
if (watchdog_timeout && now - bar > watchdog_timeout) {
raise(SIGSEGV); //something went wrong
}
}
return NULL;
}
Is there ever a time that it would be more desirable to have a watchdog loop segfault intentionally, versus exiting nonzero?
It is never desirable to elicit undefined behavior, which is what the example code does. In particular, note well that that code is not required to cause a segfault to be delivered to the process, though it might reliably do so on certain systems.
However, one might indeed prefer to kill a process via a signal instead of by calling exit(), so as to achieve termination without executing any application or library cleanup code. This is a plausible goal for a watchdog. Even in that event, however,
Either the raise() or the abort() function would definedly cause a signal to be delivered to the process.
SIGSEGV seems an odd choice of signal. Any of SIGABRT, SIGTERM, or SIGKILL would make more sense to me. Of those,
SIGKILL is not specified by the C language spec, but rather by POSIX (and maybe others). On a POSIX system, SIGKILL cannot be blocked or caught, so it is a very good candidate for a signal to terminate the process as quickly and surely as possible.
SIGABRT is used by the abort() function, which also goes to some pains to try to overcome program resistance to being terminated that way. This is the most natural standard function to use to trigger an intentional abnormal program termination.
SIGTERM can be caught and / or blocked, but unlike SIGKILL, it is defined by the C language specification, and therefore is more portable. But I don't really see any advantage over SIGABRT, unless you intend to allow it to be handled.
Another alternative would be _exit() (POSIX) or _Exit() (C99 or later). These perform a cleaner shutdown than you can expect from termination via a signal, but without executing most cleanup code. Open files will be closed, and the parent process will observe the process to terminate normally with a failure status instead of terminating by being killed by a signal.

How to make thread crash will not lead to process crash?

My project language is C and will just run one process and multiple threads.
One of thread always crash, it will case process crash.
The stability of the process is very very important. So I want to know whether it is possible to "isolation" this thread. For example, if we can intercept SIGSEGV signal, we can just restart this thread.
If one thread does something that could cause a crash it might be affecting another thread since they share the same memory space, so there's no way to isolate them.
You need to fix your program so it doesn't crash in the first place. Start by using a memory checker such as valgrind.
The stability of the process is very very important.
Then it is very very important that you find and fix the cause of the SIGSEGV rather than "papering over the cracks" by trying to recover crashed threads.
Why?
Because the root cause of the SIGSEGV could be bug that will be retriggered in the restarted thread, leading to endless crashing/restarting of threads.
Because the root cause of the SIGSEGV could be a problem in a different thread ... which could continue triggering the problem.
Because in the execution steps leading up to the SIGSEGV, the thread could have corrupted shared data structures or done other things that may cause other threads to crash, get stuck or behave incorrectly in other ways.
Depends on the cause of the crash. It could be memory related resulting in corrupt data thus leading to the process crash.
If the program does crash and it is critical, use Monit to restart it.

Can I kill another process from SIGSEGV handler?

Background: I'm fuzzing a long-lived process with afl-fuzz by passing to it the filename to process from a stub that afl-fuzz runs for each sample.
When the long-lived process crashes via SIGSEGV, I want the stub to also generate a SIGSEGV, so that afl-fuzz will mark the sample as interesting.
Will calling kill(stub_pid, SIGSEGV) from the long-lived process's SIGSEGV handler work ?
Will calling kill(stub_pid, SIGSEGV) from the long-lived process's SIGSEGV handler work ?
If a process ends up in a SIGSEGV-handler something very bad happened, which might include a completely destroyed stack and/or memory management.
It is not a good idea to rely on anything any more at this point, but just that the process goes down.
Trying to invoke any functionally beyond this point is likely to fail, that is unreliable.
A much safer approach to this would be to have the calling process monitor its child, and if the child happens to terminated unexpected (typically via SIGSEGV) start the appropriate actions.
Have a look at signal handling inside shell scripts (seach-key: "trap"), as such a script might be the parent to the process you want to monitor.
not recommended to do this through SIGSEGV but you can do this if you have proper permission.
Instead of wondering how to cause a segmentation fault in your program so that AFL would notice something odd, just call abort(). SIGABRT is caught by AFL as well and is much easier to trigger.

How to properly terminate a thread in a signal handler?

I want to set up a signal handler for SIGSEGV, SIGILL and possibly a few other signals that, rather than terminating the whole process, just terminates the offending thread and perhaps sets a flag somewhere so that a monitoring thread can complain and start another thread. I'm not sure there is a safe way to do this. Pthreads seems to provide functions for exiting the current thread, as well as canceling another thread, but these potentially call a bunch of at-exit handlers. Even if they don't, it seems as though there are many situations in which they are not async-signal-safe, although it is possible that those situations are avoidable. Is there a lower-level function I can call that just destroys the thread? Assuming I modify my own data structures in an async-signal-safe way, and acquire no mutexes, are there pthread/other global data structures that could be left in an inconsistent state simply by a thread terminating at a SIGSEGV? malloc comes to mind, but malloc itself shouldn't SIGSEGV/SIGILL unless the libc is buggy. I realize that POSIX is very conservative here, and makes no guarantees. As long as there's a way to do this in practice I'm happy. Forking is not an option, btw.
If the SIGSEGV/SIGILL/etc. happens in your own code, the signal handler will not run in an async-signal context (it's fundamentally a synchronous signal, but would still be an AS context if it happened inside a standard library function), so you can legally call pthread_exit from the signal handler. However, there are still issues that make this practice dubious:
SIGSEGV/SIGILL/etc. never occur in a program whose behavior is defined unless you generate them via raise, kill, pthread_kill, sigqueue, etc. (and in some of these special cases, they would be asynchronous signals). Otherwise, they're indicative of a program having undefined behavior. If the program has invoked undefined behavior, all bets are off. UB is not isolated to a particular thread or a particular sequence in time. If the program has UB, its entire output/behavior is meaningless.
If the program's state is corrupted (e.g. due to access-after-free, use of invalid pointers, buffer overflows, ...) it's very possible that the first faulting access will happen inside part of the standard library (e.g. inside malloc) rather than in your code. In this case, the signal handler runs in an AS-safe context and cannot call pthread_exit. Of course the program already has UB anyway (see the above point), but even if you wanted to pretend that's not an issue, you'd still be in trouble.
If your program is experiencing these kinds of crashes, you need to find the cause and fix it, not try to patch around it with signal handlers. Valgrind is your friend. If that's not possible, your best bet is to isolate the crashing code into separate processes where you can reason about what happens if they crash asynchronously, rather than having the crashing code in the same process (where any further reasoning about the code's behavior is invalid once you know it crashes).

Should segmentation fault handlers be written at all by a application programmer in C?

If someone is a operating system programmer or writing a system level library code, it makes sense to write a segmentation fault handler. Like, for example, OS programmer would write code send a signal SIGSEGV to that application process. OR a systems library programmer might handle that signal SIGSEGV and may undo the operations caused by the library code for creating segmentation Fault. But why would an application programmer in C need to write segmentation fault handler? If he writes an handler, he has already corrupted some parts of memory. Can you give an instance, for an application programmer to handle segmentation fault and continue execution of the program?
AFAIK, the segmentation handler can be written at the application level, to output some debugging information (like memory dump, value of registers and other application specific information) and then exit the application.
Pls note that, since the segmentation fault might have corrupted the memory, it may or may not get all the correct information to dump.
I am not aware of any situation, where the execution of the program can be continued after a segmentation fault. May be other esteemed users of SO will be able to throw some light on this.
Handling SIGSEGV, etc, may allow saving state and taking corrective actions. Mr 32 (and others) are correct and you can not simply restart the main line code. Instead you can longjmp()siglongjmp(); this allows a re-start of the main line. Also, you have to be very careful to call async safe functions only. This is very tricky. However some applications are,
Health/saftey - to ensure a catastrophic condition doesn't happen.
Financial - loss of transaction data that can result in a loss of money.
Control system - example titration software for chemists.
Diagnostics - Crash conditions maybe logged to improve future software. As per Jay
Calling exit() is probably not good and _exit() would be better. The difference being atexit() calls.
See also: Cert async safe, Glibc async-safe list, Similar question, longjmp() and signals not portable, async-safe
These vary from OS to OS. Any advice will be system dependent!
Additional Issues
Some libraries used by the program may catch SIGSEGV. Definitely version of the Empress Database hook it. You have to know what your libraries are using and chain to/from them.
Stack and heap (malloc,etc) can be corrupted, including the jump_buf so your error handling maybe especially paranoid.
There are many other alternate solutions, such as defer critical portions to another task that is much simpler.
longjmp() called from a signal is undefined according to the C99 standard, but it will work well on most systems. siglongjmp() can be used if you are more pedantic. It would be fine for diagnostic logging, but I wouldn't use it for the other uses listed (safety, etc). Notifying a watchdog task maybe more appropriate.
You can catch any signal except SIGKILL, SIGCONT and SIGSTOP. Thus you can catch SIGSEGV, but if you decide then not to exit, the behavior will be unpredictable.
library programmer might handle that signal SIGSEGV and may
undo the operations caused by the library code for creating segmentation
segmentation fault occurs means that threads or process will be died.
You can not undo the code caused the segmentation fault. Rather you can Re-start that component.
A segmentation fault is caused by the program writing to a portion of memory it is not supposed to. The application developer does not write code to handle this, they write code to avoid it. This is why you bound check when writing to memory.

Resources