My project language is C and will just run one process and multiple threads.
One of thread always crash, it will case process crash.
The stability of the process is very very important. So I want to know whether it is possible to "isolation" this thread. For example, if we can intercept SIGSEGV signal, we can just restart this thread.
If one thread does something that could cause a crash it might be affecting another thread since they share the same memory space, so there's no way to isolate them.
You need to fix your program so it doesn't crash in the first place. Start by using a memory checker such as valgrind.
The stability of the process is very very important.
Then it is very very important that you find and fix the cause of the SIGSEGV rather than "papering over the cracks" by trying to recover crashed threads.
Why?
Because the root cause of the SIGSEGV could be bug that will be retriggered in the restarted thread, leading to endless crashing/restarting of threads.
Because the root cause of the SIGSEGV could be a problem in a different thread ... which could continue triggering the problem.
Because in the execution steps leading up to the SIGSEGV, the thread could have corrupted shared data structures or done other things that may cause other threads to crash, get stuck or behave incorrectly in other ways.
Depends on the cause of the crash. It could be memory related resulting in corrupt data thus leading to the process crash.
If the program does crash and it is critical, use Monit to restart it.
Related
It is expected that threads, on which pthread_detach() was not called, should be pthread_join()ed before the main thread returns from main() or calls exit().
However, what happens when this requirement is not met? What happens when a process terminates when it still contains unjoined and not detached threads?
I would find it odd to learn that these other threads’ resources will not be reclaimed until system reboot. However, if these resources will be reclaimed, then there may be little need to bother about joining or detaching, mightn’t it?
It is up to the operating system. Typical modern operating systems will indeed reclaim the memory and descriptors (handles) used by abandoned threads. This is similar to how dynamically allocated memory works: typical modern systems will reclaim it when a process exits, even if the process never explicitly freed the memory. For certain unusual programs, this can be a meaningful performance optimization, because freeing lots of small resources takes time and the OS may be able to do it more quickly.
However, what happens when this requirement is not met? What happens when a process terminates when it still contains unjoined and not detached threads?
On any system with POSIX threads that is not ancient, the non-joined threads simply "evaporate" into space when the SYS_exit system call is performed by the main thread.
I would find it odd to learn that these other threads’ resources will not be reclaimed until system reboot.
They will be.
However, if these resources will be reclaimed, then there may be little need to bother about joining or detaching, mightn’t it?
It depends on what these threads do. The danger is at-exit data races.
In C++, global variables are destructed (usually via atexit or equivalent registration mechanism), FILE handles are deleted, etc. etc.
If non-joined thread tries to access any such resource, it will likely crash with SIGSEGV, possibly producing core dump, and an unclean process exit code, both of which are often quite undesirable.
Background: I'm fuzzing a long-lived process with afl-fuzz by passing to it the filename to process from a stub that afl-fuzz runs for each sample.
When the long-lived process crashes via SIGSEGV, I want the stub to also generate a SIGSEGV, so that afl-fuzz will mark the sample as interesting.
Will calling kill(stub_pid, SIGSEGV) from the long-lived process's SIGSEGV handler work ?
Will calling kill(stub_pid, SIGSEGV) from the long-lived process's SIGSEGV handler work ?
If a process ends up in a SIGSEGV-handler something very bad happened, which might include a completely destroyed stack and/or memory management.
It is not a good idea to rely on anything any more at this point, but just that the process goes down.
Trying to invoke any functionally beyond this point is likely to fail, that is unreliable.
A much safer approach to this would be to have the calling process monitor its child, and if the child happens to terminated unexpected (typically via SIGSEGV) start the appropriate actions.
Have a look at signal handling inside shell scripts (seach-key: "trap"), as such a script might be the parent to the process you want to monitor.
not recommended to do this through SIGSEGV but you can do this if you have proper permission.
Instead of wondering how to cause a segmentation fault in your program so that AFL would notice something odd, just call abort(). SIGABRT is caught by AFL as well and is much easier to trigger.
On linux, pthread (linux threads),
what does happen to the running threads when returning from main (before the threads are finished)?
When returning from main, the memory is dis-allocated so the threads should access unallocated memory. Does this cause the threads to exit?
I'm sure the threads are killed, but how does this actually happen?
I'm sure the threads are killed, but how does this actually happen?
Returning from main is the same as calling exit(). This means handlers established by atexit(), and any system cleanup handlers are run. Finally the kernel is asked to terminate the entire process(i.e. all threads).
(Note that this might cause issues if you have other threads running at that point, e.g. another thread accessing a global C++ objects right after the runtime calls their destructors.)
Well, threads operate under the process of main application (or other process but I assume you do not create another process, just threads). They share memory with it, and are the same process, so is system kills the process it automatically kills all threads. There is nothing more to it. A thread cannot exists without a process, so there is no option of accessing some disallocated memory, it just stops executing, and the memory is cleaned up on a process clean-up level.
And how it happens is obviously system dependent. E.g. Windows 95 did not free memory after a process ended, so if application had a memory leak, killing it didn't help. This had changed since then. Every system can handle it differently.
If someone is a operating system programmer or writing a system level library code, it makes sense to write a segmentation fault handler. Like, for example, OS programmer would write code send a signal SIGSEGV to that application process. OR a systems library programmer might handle that signal SIGSEGV and may undo the operations caused by the library code for creating segmentation Fault. But why would an application programmer in C need to write segmentation fault handler? If he writes an handler, he has already corrupted some parts of memory. Can you give an instance, for an application programmer to handle segmentation fault and continue execution of the program?
AFAIK, the segmentation handler can be written at the application level, to output some debugging information (like memory dump, value of registers and other application specific information) and then exit the application.
Pls note that, since the segmentation fault might have corrupted the memory, it may or may not get all the correct information to dump.
I am not aware of any situation, where the execution of the program can be continued after a segmentation fault. May be other esteemed users of SO will be able to throw some light on this.
Handling SIGSEGV, etc, may allow saving state and taking corrective actions. Mr 32 (and others) are correct and you can not simply restart the main line code. Instead you can longjmp()siglongjmp(); this allows a re-start of the main line. Also, you have to be very careful to call async safe functions only. This is very tricky. However some applications are,
Health/saftey - to ensure a catastrophic condition doesn't happen.
Financial - loss of transaction data that can result in a loss of money.
Control system - example titration software for chemists.
Diagnostics - Crash conditions maybe logged to improve future software. As per Jay
Calling exit() is probably not good and _exit() would be better. The difference being atexit() calls.
See also: Cert async safe, Glibc async-safe list, Similar question, longjmp() and signals not portable, async-safe
These vary from OS to OS. Any advice will be system dependent!
Additional Issues
Some libraries used by the program may catch SIGSEGV. Definitely version of the Empress Database hook it. You have to know what your libraries are using and chain to/from them.
Stack and heap (malloc,etc) can be corrupted, including the jump_buf so your error handling maybe especially paranoid.
There are many other alternate solutions, such as defer critical portions to another task that is much simpler.
longjmp() called from a signal is undefined according to the C99 standard, but it will work well on most systems. siglongjmp() can be used if you are more pedantic. It would be fine for diagnostic logging, but I wouldn't use it for the other uses listed (safety, etc). Notifying a watchdog task maybe more appropriate.
You can catch any signal except SIGKILL, SIGCONT and SIGSTOP. Thus you can catch SIGSEGV, but if you decide then not to exit, the behavior will be unpredictable.
library programmer might handle that signal SIGSEGV and may
undo the operations caused by the library code for creating segmentation
segmentation fault occurs means that threads or process will be died.
You can not undo the code caused the segmentation fault. Rather you can Re-start that component.
A segmentation fault is caused by the program writing to a portion of memory it is not supposed to. The application developer does not write code to handle this, they write code to avoid it. This is why you bound check when writing to memory.
What happens when two threads of the same process running on different logical cpu hit a seg fault?
Default action is for the process to exit. If you handle the segfault, I suppose you could try to arrange for just the thread where it happened to terminate. However, since the only things which cause a segfault to occur naturally (as opposed to raise or kill) stem from undefined behavior, the program is in an indeterminate state and you can't rely on being able to recover anything.
Normal handling of a Segmentation Fault involves the termination of the process. That means that both of them are terminated.
I think the default action on all major OSes is to terminate the process. However, you could conceivably install (e.g using signal) an alternate handler that only terminated the thread. Of course, once you have a segmentation fault, behavior typically becomes undefined, and attempting to continue is risky.
Signals generated due to illegal execution are handled synchronously by the kernel. So even if both the threads generate seg fault at the same time, only one gets thru'.
The segfault handler is called for the process. If you don't do anything special, the OS will kill the process and reclaim its resources.