Debugging signal handling in multithreaded application

Debugging signal handling in multithreaded application - c

I have this multithreaded application using pthreads. My threads actually wait for signals using sigwait. Actually, I want to debug my application, see which thread receives which signal at what time and then debug it. Is there any method, I can do this. If I directly run my program, then signals are generated rapidly and handled by my handler threads. I want to see which handler wakes up from the sigwait call and processes the signal and all.

The handy strace utility can print out a huge amount of useful information regarding system calls and signals. It would be useful to log timing information or collect statistics regarding the performance of signal usage.
If instead you are interested in getting a breakpoint inside of an event triggered by a specific signal, you could consider stashing enough relevant information to identify the event in a variable and setting a conditional breakpoint.

One of the things you may try with gdb is set breakpoints by thread (e.g. just after return from sigwait), so you know which thread wakes up:
break file.c thread [thread_nr]
Don't forget to tell gdb to pass signals to your program e.g.:
handle SIGINT pass
You may want to put all of this into your .gdbinit file to save yourself a lot of typing.
Steven Schlansker is definitely right: if that happens to significantly change timing patterns of your program (so you can see that your program behaves completely different under debugger, than "in the wild") then strace and logging is your last hope.
I hope that helps.

Related

Catching signal from Python child process using GLib

I'm trying to control the way my cursor looks during certain points of my program execution. To be specific, I want it to be a "spinner" when a Python script is executing, and then a standard pointer when it's done executing. Right now, I have a leave-event-notify callback in Glade that changes the spinner when it leaves a certain area, but this is non-ideal since the user might not know to move the cursor and the cursor doesn't accurately represent the state of the program.
I have my Python program signalling SIGUSR1 at the end of execution. I am spawning the Python script from a C file using GLib's g_spawn_async_with_pipes. Is there any way to catch a signal from the child process that this creates? Thanks!

Pass the G_SPAWN_DO_NOT_REAP_CHILD flag to g_spawn_async_with_pipes() and then call g_child_watch_add() to get a notification when your Python subprocess exits. You don’t need to bother with SIGUSR1 if the process exits when it’s done.
It’s a bit hard to provide a more specific answer unless you post a minimal reproducible example of your code.

Check if pthread is still alive in Linux C

I know similar questions have been asked, but I think my situation is little bit different. I need to check if child thread is alive, and if it's not print error message. Child thread is supposed to run all the time. So basically I just need non-block pthread_join and in my case there are no race conditions. Child thread can be killed so I can't set some kind of shared variable from child thread when it completes because it will not be set in this case.
Killing of child thread can be done like this:
kill -9 child_pid
EDIT: alright, this example is wrong but still I'm sure there exists way to kill a specific thread in some way.
EDIT: my motivation for this is to implement another layer of security in my application which requires this check. Even though this check can be bypassed but that is another story.
EDIT: lets say my application is intended as a demo for reverse engineering students. And their task is to hack my application. But I placed some anti-hacking/anti-debugging obstacles in child thread. And I wanted to be sure that this child thread is kept alive. As mentioned in some comments - it's probably not that easy to kill child without messing parent so maybe this check is not necessary. Security checks are present in main thread also but this time I needed to add them in another thread to make main thread responsive.

killed by what and why that thing can't indicate the thread is dead? but even then this sounds fishy
it's almost universally a design error if you need to check if a thread/process is alive - the logic in the code should implicitly handle this.
In your edit it seems you want to do something about a possibility of a thread getting killed by something completely external.
Well, good news. There is no way to do that without bringing the whole process down. All ways of non-voluntary death of a thread kill all threads in the process, apart from cancellation but that can only be triggered by something else in the same process.

The kill(1) command does not send signals to some thread, but to a entire process. Read carefully signal(7) and pthreads(7).
Signals and threads don't mix well together. As a rule of thumb, you don't want to use both.
BTW, using kill -KILL or kill -9 is a mistake. The receiving process don't have the opportunity to handle the SIGKILL signal. You should use SIGTERM ...
If you want to handle SIGTERM in a multi-threaded application, read signal-safety(7) and consider setting some pipe(7) to self (and use poll(2) in some event loop) which the signal handler would write(2). That well-known trick is well explained in Qt documentation. You could also consider the signalfd(2) Linux specific syscall.
If you think of using pthread_kill(3), you probably should not in your case (however, using it with a 0 signal is a valid but crude way to check that the thread exists). Read some Pthread tutorial. Don't forget to pthread_join(3) or pthread_detach(3).
Child thread is supposed to run all the time.
This is the wrong approach. You should know when and how a child thread terminates because you are coding the function passed to pthread_create(3) and you should handle all error cases there and add relevant cleanup code (and perhaps synchronization). So the child thread should run as long as you want it to run and should do appropriate cleanup actions when ending.
Consider also some other inter-process communication mechanism (like socket(7), fifo(7) ...); they are generally more suitable than signals, notably for multi-threaded applications. For example you might design your application as some specialized web or HTTP server (using libonion or some other HTTP server library). You'll then use your web browser, or some HTTP client command (like curl) or HTTP client library like libcurl to drive your multi-threaded application. Or add some RPC ability into your application, perhaps using JSONRPC.
(your putative usage of signals smells very bad and is likely to be some XY problem; consider strongly using something better)
my motivation for this is to implement another layer of security in my application
I don't understand that at all. How can signal and threads add security? I'm guessing you are decreasing the security of your software.
I wanted to be sure that this child thread is kept alive.
You can't be sure, other than by coding well and avoiding bugs (but be aware of Rice's theorem and the Halting Problem: there cannot be any reliable and sound static source code program analysis to check that). If something else (e.g. some other thread, or even bad code in your own one) is e.g. arbitrarily modifying the call stack of your thread, you've got undefined behavior and you can just be very scared.
In practice tools like the gdb debugger, address and thread sanitizers, other compiler instrumentation options, valgrind, can help to find most such bugs, but there is No Silver Bullet.
Maybe you want to take advantage of process isolation, but then you should give up your multi-threading approach, and consider some multi-processing approach. By definition, threads share a lot of resources (notably their virtual address space) with other threads of the same process. So the security checks mentioned in your question don't make much sense. I guess that they are adding more code, but just decrease security (since you'll have more bugs).
Reading a textbook like Operating Systems: Three Easy Pieces should be worthwhile.

You can use pthread_kill() to check if a thread exists.
SYNOPSIS
#include <signal.h>
int pthread_kill(pthread_t thread, int sig);
DESCRIPTION
The pthread_kill() function shall request that a signal be delivered
to the specified thread.
As in kill(), if sig is zero, error checking shall be performed
but no signal shall actually be sent.
Something like
int rc = pthread_kill( thread_id, 0 );
if ( rc != 0 )
{
// thread no longer exists...
}
It's not very useful, though, as stated by others elsewhere, and it's really weak as any type of security measure. Anything with permissions to kill a thread will be able to stop it from running without killing it, or make it run arbitrary code so that it doesn't do what you want.

Can I catch SIGSEGV and other signals in a multi-threaded (pthreads) app and print a backtrace of the thread that caused it, or all threads?

I saw Getting a backtrace of other thread but it didn't contain a lot of practical information.
What I want is to be able to catch SIGSEGV in a C multi-threaded app using POSIX threads running on Linux (CentOS, 2.6 kernel), and print the stack trace of the thread that caused it. Of course, not knowing which thread caused it, it's Good Enough For Me (tm) that the main thread that caught the signal to enumerate over all the threads and just print the stack trace of all of them.
It was noted over there that perhaps libunwind can be used for this, but its documentation is rather lacking and I couldn't find a good example of how to go about using it for this purpose. Also, I wondered if it has any significant performance overhead or other impact, and whether it is battle-tested and used in production code, or if it's mostly only used in debugging and development, and not in production systems.
Does anyone have sample code using libunwind or another reasonably straightforward (like not writing it in assembly) way to do this?

Getting the backtrace of the thread that caused the exception is easy, more or less:
Pass the -rdynamic flag to the linker
Then, in your coderegister signal handler, extract the EIP of the fault from the signal handler parameters and then use it and the backtrace() function to get an array of the addresses.
Find some way to pass the data in the array outside your app (to a different process over a pipe for exeample) and there you can use backtrace_symbols() to translate the backtrace to symbol names.
Make sure not to use any thread async non safe function in the signal handler, don't take any locks, allocate memory or call any function that does.
Here are the slides to a presentation I gave on the subject: http://www.scribd.com/doc/3726406/Crash-N-Burn-Writing-Linux-application-fault-handlers
The video is also available somewhere of the talk but I can't find it now...
Extending this to get the backtrace of multiple threads is possible but quite tricky - you need to keep tab of your various threads and send signals to them at the event of a crash

How do unix signals work?

How do signals work in unix? I went through W.R. Stevens but was unable to understand. Please help me.

The explanation below is not exact, and several aspects of how this works differ between different systems (and maybe even the same OS on different hardware for some portions), but I think that it is generally good enough for you to satisfy your curiosity enough to use them. Most people start using signals in programming without even this level of understanding, but before I got comfortable using them I wanted to understand them.
signal delivery
The OS kernel has a data structure called a process control block for each process running which has data about that process. This can be looked up by the process id (PID) and included a table of signal actions and pending signals.
When a signal is sent to a process the OS kernel will look up that process's process control block and examines the signal action table to locate the action for the particular signal being sent. If the signal action value is SIG_IGN then the new signal is forgotten about by the kernel. If the signal action value is SIG_DFL then the kernel looks up the default signal handling action for that signal in another table and preforms that action. If the values are anything else then that is assumed to be a function address within the process that the signal is being sent to which should be called. The values for SIG_IGN and SIG_DFL are numbers cast to function pointers whose values are not valid addresses within a process's address space (such as 0 and 1, which are both in page 0, which is never mapped into a process).
If a signal handling function were registered by the process (the signal action value was neither SIG_IGN or SIG_DFL) then an entry in the pending signal table is made for that signal and that process is marked as ready to RUN (it may have been waiting on something, like data to become available for a call to read, waiting for a signal, or several other things).
Now the next time that the process is run the OS kernel will first add some data to the stack and changes the instruction pointer for that process so that it looks almost like the process itself has just called the signal handler. This is not entirely correct and actually deviates enough from what actually happens that I'll talk about it more in a little bit.
The signal handler function can do whatever it does (it is part of the process that it was called on behalf of, so it was written with knowledge about what that program should do with that signal). When the signal handler returns then the regular code for the process begins executing again. (again, not accurate, but more on that next)
Ok, the above should have given you a pretty good idea of how signals are delivered to a process. I think that this pretty good idea version is needed before you can grasp the full idea, which includes some more complicated stuff.
Very often the OS kernel needs to know when a signal handler returns. This is because signal handlers take an argument (which may require stack space), you can block the same signal from being delivered twice during the execution of the signal handler, and/or have system calls restarted after a signal is delivered. To accomplish this a little bit more than stack and instruction pointer changes.
What has to happen is that the kernel needs to make the process tell it that it has finished executing the signal handler function. This may be done by mapping a section of RAM into the process's address space which contains code to make this system call and making the return address for the signal handler function (the top value on the stack when this function started running) be the address of this code. I think that this is how it is done in Linux (at least newer versions). Another way to accomplish this (I don't know if this is done, but it could be) would be do make the return address for the signal handler function be an invalid address (such as NULL) which would cause an interrupt on most systems, which would give the OS kernel control again. It doesn't matter a whole lot how this happens, but the kernel has to get control again to fix up the stack and know that the signal handler has completed.
WHILE LOOKING INTO ANOTHER QUESTION I LEARNED
that the Linux kernel does map a page into the process for this, but that the actual system call for registering signal handlers (what sigaction calls ) takes a parameter sa_restore parameter, which is an address that should be used as the return address from the signal handler, and the kernel just makes sure that it is put there. The code at this address issues the I'm done system call (sigreturn)and the kernel knows that the signal handler has finished.
signal generation
I'm mostly assuming that you know how signals are generated in the first place. The OS can generate them on behalf of a process due to something happening, like a timer expiring, a child process dying, accessing memory that it should not be accessing, or issuing an instruction that it should not (either an instruction that does not exist or one that is privileged), or many other things. The timer case is functionally a little different from the others because it may occur when the process is not running, and so is more like the signals sent with the kill system call. For the non-timer related signals sent on behalf of the current process these are generated when an interrupt occurs because the current process is doing something wrong. This interrupt gives the kernel control (just like a system call) and the kernel generates the signal to be delivered to the current process.

Some issues that are not addressed in all of the above statements are multi core, running in kernel space while receiving a signal, sleeping in kernel space while receiving a signal, system call restarting and signal handler latency.
Here are a couple of issues to consider:
What if the kernel knows that a signal needs to be delivered to process X which is running on CPU_X, but the kernel learns about it while running on CPU_Y (CPU_X!=CPU_Y). So the kernel needs to stop the process from running on a different core.
What if the process is running in kernel space while receiving a signal? Every time a process makes a system call it enters kernel space and tinkers with data structures and memory allocations in kernel space. Does all of this hacking take place in kernel space too?
What if the process is sleeping in kernel space waiting for some other event? (read, write, signal, poll, mutex are just some options).
Answers:
If the process is running on another CPU the kernel, via cross CPU communication, will deliver an interrupt to the other CPU and a message for it. The other CPU will, in hardware, save state and jump to the kernel on the other CPU and then will do the delivery of the signal on the other CPU. This is all a part of trying not to execute the signal handler of the process on another CPU which will break cache locality.
If the process is running in kernel space it is not interrupted. Instead it is recorded that this process has received a signal. When the process exits kernel space (at the end of each system call), the kernel will setup the trampoline to execute the signal handler.
If the process, while running in kernel space, after having received a signal, reaches a sleep function, then that sleep function (and this is common to all sleep functions within the kernel) will check if the process has a signal pending. If it is so, it will not put the process to sleep and instead will cancel all that has been done while coming down into the kernel, and will exit to user space while setting up a trampoline to execute the signal handler and then restart the system call. You can actually control which signals you want to interrupt system calls and which you do not using the siginterrupt(2) system call. You can decide if you want system calls restartable for a certain signal when you register the signal using sigaction(2) with the SA_RESTART flag. If a system call is issued and is cut off by a signal and is not restarted automatically you will get an EINTR (interrupted) return value and you must handle that value. You can also look at the restart_syscall(2) system call for more details.
If the process is already sleeping/waiting in kernel space (actually all sleeping/waiting is always in kernel space) it is woken from the sleep, kernel code cleans up after itself and jump to signal handler on return to user space after which the system call is automatically restarted if the user so desired (very similar to previous explanation of what happens if the process is running in kernel space).
A few notes about why all of this is so complex:
You cannot just stop a process running in kernel space since the kernel developer allocates memory, does things to data structures and more. If you just take the control away you will corrupt the kernel state and cause a machine hang. The kernel code must be notified in a controlled way that it must stop its running, return to user space and allow user space to handle the signal. This is done via the return value of all (well, almost all) sleeping functions in the kernel. And kernel programmers are expected to treat those return values with respect and act accordingly.
Signals are asynchronous. This means that they should be delivered as soon as possible. Imagine a process that has only one thread, went to sleep for hour, and is delivered a signal. Sleep is inside the kernel. So you except the kernel code to wake up, clean up after itself, return to user space and execute the signal handler, possibly restarting the system call after the signal handler finished. You certainly do not expect that process to only execute the signal handler an hour later. Then you expect the sleep to resume. Great trouble is taken by the user space and kernel people to allow just that.
All in all signals are like interrupt handlers but for user space. This is a good analogy but not perfect. While interrupt handlers are generated by hardware some signal handlers originate from hardware but most are just software (signal about a child process dying, signal from another process using the kill(2) syscall and more).
So what is the latency of signal handling?
If when you get a signal some other process is running then it up to the kernel scheduler to decide if to let the other process finish its time slice and only then deliver the signal or not. If you are on a regular Linux/Unix system this means that you could be delayed by 1 or more time slices before you get the signal (which means milliseconds which are equivalent to eternity).
When you get a signal, if your process is high-priority or other processes already got their time slice you will get the signal quite fast. If you are running in user space you will get it "immediately", if you are running in kernel space you will shortly reach a sleep function or return from kernel in which case when you return to user space your signal handler will be called. That is usually a short time since not a lot of time is spent in the kernel.
If you are sleeping in the kernel, and nothing else is above your priority or needs to run, the kernel thread handling your system call is woken up, cleans up after all the stuff it did on the way down into the kernel, goes back to user space and executes your signal. This doesn't take too long (were talking microseconds here).
If you are running a real time version of Linux and your process has the highest real time priority then you will get the signal very soon after it is triggered. Were talking 50 microseconds or even better (depends on other factors that I cannot go into).

Think of the signal facility as interrupts, implemented by the OS (instead of in hardware).
As your program merrily traverses its locus of execution rooted in main(), these interrupts can occur, cause the program to be dispatched to a vector (handler), run the code there, and then return to the location where it got interrupted.
These interrupts (signals) can originate from a variety of sources e.g. hardware errors like accessing bad or misaligned addresses, death of a child process, user generated signals using the kill command, or from other processes using the kill system call. The way you consume signals is by designating handlers for them, which are dispatched by the OS when the signals occur. Note that some of these signals cannot be handled, and result in the process simply dying.
But those that can be handled, can be quite useful. You can use them for inter process communication i.e. one process sends a signal to another process, which handles it, and in the handler does something useful. Many daemons will do useful things like reread the configuration file if you send them the right signal.

Signal are nothing but an interrupt in the execution of the process. A process can signal itself or it can cause a signal to be passed to another process. Maybe a parent can send a signal to its child in order to terminate it, etc..
Check the following link to understand.
https://unix.stackexchange.com/questions/80044/how-signals-work-internally
http://www.linuxjournal.com/article/3985
http://www.linuxprogrammingblog.com/all-about-linux-signals?page=show

detect program termination (C, Windows)

I have a program that has to perform certain tasks before it finishes. The problem is that sometimes the program crashes with an exception (like database cannot be reached, etc).
Now, is there any way to detect an abnormal termination and execute some code before it dies?
Thanks.
code is appreciated.

1. Win32
The Win32 API contains a way to do this via the SetUnhandledExceptionFilter function, as follows:
LONG myFunc(LPEXCEPTION_POINTERS p)
{
printf("Exception!!!\n");
return EXCEPTION_EXECUTE_HANDLER;
}
int main()
{
SetUnhandledExceptionFilter((LPTOP_LEVEL_EXCEPTION_FILTER)&myFunc);
// generate an exception !
int x = 0;
int y = 1/x;
return 0;
}
2. POSIX/Linux
I usually do this via the signal() function and then handle the SIGSEGV signal appropriately. You can also handle the SIGTERM signal and SIGINT, but not SIGKILL (by design). You can use strace() to get a backtrace to see what caused the signal.

There are sysinternals forum threads about protecting against end-process attempts by hooking NT Internals, but what you really want is either a watchdog or peer process (reasonable approach) or some method of intercepting catastrophic events (pretty dicey).
Edit: There are reasons why they make this difficult, but it's possible to intercept or block attempts to kill your process. I know you're just trying to clean up before exiting, but as soon as someone releases a process that can't be immediately killed, someone will ask for a method to kill it immediately, and so on. Anyhow, to go down this road, see above linked thread and search some keywords you find in there for more. hook OR filter NtTerminateProcess etc. We're talking about kernel code, device drivers, anti-virus, security, malware, rootkit stuff here. Some books to help in this area are Windows NT/2000 Native API, Undocumented Windows 2000 Secrets: A Programmer's Cookbook, Rootkits: Subverting the Windows Kernel, and, of course, Windows® Internals: Fifth Edition. This stuff is not too tough to code, but pretty touchy to get just right, and you may be introducing unexpected side-effects.
Perhaps Application Recovery and Restart Functions could be of use? Supported by Vista and Server 2008 and above.
ApplicationRecoveryCallback Callback Function Application-defined callback function used to save data and application state information in the event the application encounters an unhandled exception or becomes unresponsive.
On using SetUnhandledExceptionFilter, MSDN Social discussion advises that to make this work reliably, patching that method in-memory is the only way to be sure your filter gets called. Advises to instead wrap with __try/__except. Regardless, there is some sample code and discussion of filtering calls to SetUnhandledExceptionFilter in the article "SetUnhandledExceptionFilter" and VC8.
Also, see Windows SEH Revisited at The Awesome Factor for some sample code of AddVectoredExceptionHandler.

It depends what do you do with your "exceptions". If you handle them properly and exit from program, you can register you function to be called on exit, using atexit().
It won't work in case of real abnormal termination, like segfault.
Don't know about Windows, but on POSIX-compliant OS you can install signal handler that will catch different signals and do something about it. Of course you cannot catch SIGKILL and SIGSTOP.
Signal API is part of ANSI C since C89 so probably Windows supports it. See signal() syscall for details.

If it's Windows-only, then you can use SEH (SetUnhandledExceptionFilter), or VEH (AddVectoredExceptionHandler, but it's only for XP/2003 and up)

Sorry, not a windows programmer. But maybe
_onexit()
Registers a function to be called when program terminates.
http://msdn.microsoft.com/en-us/library/aa298513%28VS.60%29.aspx

First, though this is fairly obvious: You can never have a completely robust solution -- someone can always just hit the power cable to terminate your process. So you need a compromise, and you need to carefully lay out the details of that compromise.
One of the more robust solutions is putting the relevant code in a wrapper program. The wrapper program invokes your "real" program, waits for its process to terminate, and then -- unless your "real" program specifically signals that it has completed normally -- runs the cleanup code. This is fairly common for things like test harnesses, where the test program is likely to crash or abort or otherwise die in unexpected ways.
That still gives you the difficulty of what happens if someone does a TerminateProcess on your wrapper function, if that's something you need to worry about. If necessary, you could get around that by setting it up as a service in Windows and using the operating system's features to restart it if it dies. (This just changes things a little; someone could still just stop the service.) At this point, you probably are at a point where you need to signal successful completion by something persistent like creating a file.

I published an article at ddj.com about "post mortem debugging" some years ago.
It includes sources for windows and unix/linux to detect abnormal termination. By my experience though, a windows handler installed using SetUnhandledExceptionFilter is not always called. In many cases it is called, but I receive quite a few log files from customers that do not include a report from the installed handlers, where i.e. an ACCESS VIOLATION was the cause.
http://www.ddj.com/development-tools/185300443

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight