Automatically attaching to process on SEGV and other fatal signals (panic_action) - c

Background
Code to support a 'panic_action' was recently added to the FreeRADIUS v3.0.x, v2.0.x and master branches.
When radiusd (main FreeRADIUS process) receives a fatal signal (SIGFPE, SIGABRT, SIGSEGV etc...), the signal handler executes a predefined 'panic_action' which is a snippet of shell code passed to system(). The signal handler performs basic substitution for %e and %p writing in the values of the current binary name, and the current PID.
This should in theory allow a debugger like gdb or lldb to attach to the process (panic_action = lldb -f %e -p %p), either to perform interactive debugging, or to automate collection of a backtrace. This actually works well on my system OSX 10.9.2 with lldb, but only for SIGABRT.
Problem
This doesn't seem to work for other signals like SIGSEGV. The mini backtrace from execinfo is valid, but when lldb or gdb attach to the process, they only get the backtrace from for the signal handler.
There doesn't seem to be a way in lldb to switch to an arbitrary frame address.
Does anyone know if there's any way of forcing the signal handler to execute in the same stack as the the thread that received the signal? Or why when lldb attaches the backtraces don't show the full stack.
The actual output looks like:
FATAL SIGNAL: Segmentation fault: 11
Backtrace of last 12 frames:
0 libfreeradius-radius.dylib 0x000000010cf1f00f fr_fault + 127
1 libsystem_platform.dylib 0x00007fff8b03e5aa _sigtramp + 26
2 radiusd 0x000000010ce7617f do_compile_modsingle + 3103
3 libfreeradius-server.dylib 0x000000010cef3780 fr_condition_walk + 48
4 radiusd 0x000000010ce7710f modcall_pass2 + 191
5 radiusd 0x000000010ce7713f modcall_pass2 + 239
6 radiusd 0x000000010ce7078d virtual_servers_load + 685
7 radiusd 0x000000010ce71df1 setup_modules + 1633
8 radiusd 0x000000010ce6daae read_mainconfig + 2526
9 radiusd 0x000000010ce78fe6 main + 1798
10 libdyld.dylib 0x00007fff8580a5fd start + 1
11 ??? 0x0000000000000002 0x0 + 2
Calling: lldb -f /usr/local/freeradius/sbin/radiusd -p 1397
Current executable set to '/usr/local/freeradius/sbin/radiusd' (x86_64).
Attaching to process with:
process attach -p 1397
Process 1397 stopped
(lldb) bt
error: libfreeradius-radius.dylib debug map object file '/Users/arr2036/Documents/Repositories/freeradius-server-fork/build/objs//Users/arr2036/Documents/Repositories/freeradius-server-master/src/lib/debug.o' has changed (actual time is 0x530f3d21, debug map time is 0x530f37a5) since this executable was linked, file will be ignored
* thread #1: tid = 0x8d824, 0x00007fff867fee38 libsystem_kernel.dylib`wait4 + 8, queue = 'com.apple.main-thread, stop reason = signal SIGSTOP
frame #0: 0x00007fff867fee38 libsystem_kernel.dylib`wait4 + 8
frame #1: 0x00007fff82869090 libsystem_c.dylib`system + 425
frame #2: 0x000000010cf1f2e1 libfreeradius-radius.dylib`fr_fault + 849
frame #3: 0x00007fff8b03e5aa libsystem_platform.dylib`_sigtramp + 26
(lldb)
Code
The relevant code for fr_fault() is here:https://github.com/FreeRADIUS/freeradius-server/blob/b7ec8c37c7204accbce4be4de5013397ab662ea3/src/lib/debug.c#L227
and fr_set_signal() the function used to setup signal handlers is here: https://github.com/FreeRADIUS/freeradius-server/blob/0cf0e88704228e8eac2948086e2ba2f4d17a5171/src/lib/misc.c#L61
As the links contain commit hashes the code should be static
EDIT
Finally with version lldb-330.0.48 on OSX 10.10.4 lldb can now go past _sigtram.
frame #2: 0x000000010b96c5f7 libfreeradius-radius.dylib`fr_fault(sig=11) + 983 at debug.c:735
732 FR_FAULT_LOG("Temporarily setting PR_DUMPABLE to 1");
733 }
734
-> 735 code = system(cmd);
736
737 /*
738 * We only want to error out here, if dumpable was originally disabled
(lldb)
frame #3: 0x00007fff8df77f1a libsystem_platform.dylib`_sigtramp + 26
libsystem_platform.dylib`_sigtramp:
0x7fff8df77f1a <+26>: decl -0x16f33a50(%rip)
0x7fff8df77f20 <+32>: movq %rbx, %rdi
0x7fff8df77f23 <+35>: movl $0x1e, %esi
0x7fff8df77f28 <+40>: callq 0x7fff8df794d8 ; symbol stub for: __sigreturn
(lldb)
frame #4: 0x000000010bccb027 rlm_json.dylib`_json_map_proc_get_value(ctx=0x00007ffefa62dbe0, out=0x00007fff543534b8, request=0x00007ffefa62da30, map=0x00007ffefa62aaf0, uctx=0x00007fff54353688) + 391 at rlm_json.c:191
188 }
189 vp = map->op;
190
-> 191 if (value_data_steal(vp, &vp->data, vp->da->type, value) < 0) {
192 REDEBUG("Copying data to attribute failed: %s", fr_strerror());
193 talloc_free(vp);
194 goto error;

This is a bug in lldb related to backtracing through _sigtramp, the asynchronous signal handler in user processes. Unfortunately I can't suggest a workaround for this problem. It has been fixed in the top of tree sources for lldb at http://lldb.llvm.org/ if you're willing to build from source (see the "Source" and "Build" sidebars). But Xcode 5.0 and the next dot release are going to have real problems backtracing past _sigtramp.

Related

Why does fork() fail on MacOs Big Sur if the executable that runs it is deleted?

If a running process's executable is deleted, I've noticed fork fails where the child process is never executed.
For example, consider the code below:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main(void) {
sleep(5);
pid_t forkResult;
forkResult = fork();
printf("after fork %d \n", forkResult);
return 0;
}
If I compile this and delete the resulting executable before fork is called, I never see fork return a pid of 0, meaning the child process never starts. I only have a Mac running Big Sur, so not sure if this repros on other OS's.
Does anyone know why this would be? My understanding is an executable should work just fine even if it's deleted while still running.
The expectation that the process should continue even if the binary was deleted is correct, however not fully correct in case of macOS. The example is tripping on a side-effect of the System Integrity Protection (SIP) mechanism inside the macOS kernel, however before explaining what is exactly going on, we need to make several experiments which will help us to better understand the whole scenario.
Modified example to better demonstrate the issue
To demonstrate what is going on, I had modified the example to count to 9, than do the fork, after the fork, the child will print a message "I am done", wait 1 second and exit by printing the 0 as the PID. The parent will continue to count to 14 and print the child PID. The code is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main(void) {
for(int i=0; i <10; i++)
{
sleep(1);
printf("%i ", i);
}
pid_t forkResult;
forkResult = fork();
if (forkResult != 0) {
for(int i=10; i < 15; i++) {
sleep(1);
printf("%i ", i);
}
} else {
sleep(1);
printf("I am done ");
}
printf("after fork %d \n", forkResult);
return 0;
}
After compiling it, I have started the normal scenario:
╰> ./a.out
0 1 2 3 4 5 6 7 8 9 I am done after fork 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 after fork 4385
So, the normal scenario works as expected. The fact that we see the count from 0 to 9 two times, is due to the copy of the buffers for stdout that was done in the fork call.
Tracing the failing example
Now is time to do the negative scenario, we will wait for 5 seconds after the start and remove the binary.
╰> ./a.out & (sleep 5 && rm a.out)
[4] 8555
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 after fork 8677
[4] 8555 done ./a.out
We see that the output is only from the parent. Since the parent had counted to 14, and shows valid PID for the child, however the child is missing, it never printed anything. So, the child creation failed after the fork() was performed, otherwise fork() would have received and error instead of a valid PID. Traces from ktrace reveal that the child was created under the pid and was waken up:
test5-ko.txt:2021-04-07 13:34:26.623783 +04 0.3 MACH_DISPATCH 1bc 0 84 4 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623783 +04 0.2 TMR_TimerCallEnter 9931ba49ead1bd17 0 330e7e4e9a59 41 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623783 +04 0.0(0.0) TMR_TimerCallEnter 9931ba49ead1bd17 0 330e7e4e9a59 0 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623783 +04 0.0 TMR_TimerCallEnter 9931ba49ead1bd17 0 330e7e4e9a59 0 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623854 +04 0.0 imp_thread_qos_and_relprio 88775d 20000 20200 6 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623854 +04 0.0 imp_update_thread 88775d 811200 140000100 1f 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.1(0.8) imp_update_thread 88775d c15200 140000100 25 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.0(1.1) imp_thread_qos_and_relprio 88775d 30000 20200 40 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.0 imp_thread_qos_workq_override 88775d 30000 20200 0 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.0 imp_update_thread 88775d c15200 140000100 25 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.1(0.1) imp_update_thread 88775d c15200 140000100 25 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623855 +04 0.0(0.2) imp_thread_qos_workq_override 88775d 30000 20200 40 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623857 +04 1.3 TURNSTILE_turnstile_added_to_thread_heap 88775d 9931ba6049ddcc77 0 0 888065 2 a.out(8677)
test5-ko.txt:2021-04-07 13:34:26.623858 +04 1.0 MACH_MKRUNNABLE 88775d 25 0 5 888065 2 a.out(8677)
t
So the child's process was dispatched with MACH_DISPATCH and made runnable with MACH_MKRUNNABLE. This is the reason the parent got valid PID after the fork().
Further more the ktrace for the normal scenario shows that the process had issued BSC_exit and and imp_task_terminated system call occurred, which is the normal way for a process to exit. However, in the second scenario where we had deleted the file, the trace doesn't show BSC_exit. This means that the child was terminated by the kernel, not by a normal termination. And we know that the termination happend after the child was created properly, since the parent had received the valid PID and the PID was made runnable.
This bring us closer to the understanding of what is going on here. But, before we have the conclusion, let's show another even more "twisted" example.
Even more strange example
What if we replace the binary on the filesystem after we started the process?
Here is the test to answer this question: we will start the process, remove the binary and create an empty file with the same name on his place with touch.
╰> ./a.out & (sleep 5 && rm a.out; touch a.out)
[1] 6264
0 1 2 3 4 5 6 7 8 9 I am done after fork 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 after fork 6851
[1] + 6722 done ./a.out
Wait a minute, this works!? What is going on here!?!?
This strange example gives us important clue that will help us to explain what is going on.
The root-cause of the issue
The reason why the third example works, while the second one is failing, reveals a lot of what is going on here. As mentioned on the beginning, we are tripping on a side-effect of SIP, more precisely on the runtime protection mechanism.
To protect the system integrity, SIP will examine the running processes for the system protection and special entitlement. From the apple documentation: ...When a process is started, the kernel checks to see whether the main executable is protected on disk or is signed with an special system entitlement. If either is true, then a flag is set to denote that it is protected against modification. Any attempt to attach to a protected process is denied by the kernel...
When we had removed the binary from the filesystem, the protection mechanism was not able to identify the type of process for the child nor the special system entitlements since the binary file was missing from the disk. This triggered the protection mechanism to treat this process as an intruder in the system and terminate it, hanse we had not seen the BSC_exit for the child process.
In the third example, when we created dummy entry on the file system with touch, the SIP was able to detect that this is not a special process nor it has special entitlements and allowed the process to continue. This is a very solid indication that we ware tripping on the SIP realtime protection mechanism.
To prove that this is the case, I have disabled the SIP which requires a restart in the recovery mode and executed the test
╰> csrutil status
System Integrity Protection status: disabled.
╰> ./a.out & (sleep 5 && rm a.out)
[1] 1504
0 1 2 3 4 5 6 7 8 9 I am done after fork 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 after fork 1626
Conclusion
So, the whole issue was caused by the System Integrity Protection. More details can be fond in the documentation
All the SIP needed was to have a file on the filesystem with the process name, so the mechanism can run the verification and decide to allow the child to continue the execution. This is showing us that we are observing a side-effect, rather than designed behavior, since the empty file was not even a valid dwarf, yet the execution had proceed.

GDB pid changes when connected automatically and quit

I am trying to attach an existing process, run some commands and print the required information. However, when I do it, I see that the PID of the process is changed (with killed) command being displayed.
Code
1 #include <iostream>
2 #include <unistd.h>
3 using namespace std;
4
5
6 int main()
7 {
8 do
9 {
10 static int s = 100;
11 s = s+1;
12 sleep (3);
13 } while(1);
14 return 0;
15 }
16
GDB commands
> cat /tmp/command.txt
set pagination off
set logging file /home/testgrp/gdb.txt
set logging on
b sample.cc:11
commands 1
p s
end
run 1
quit
Output
root#198.18.81.198:/desktop/user1/workspace# ps -eaf | grep out
root 16724 8877 0 08:25 pts/1 00:00:00 grep --color=auto out
root#198.18.81.198:/desktop/user1/workspace# cat /home/testgrp/gdb.txt
cat: /home/testgrp/gdb.txt: No such file or directory
root#198.18.81.198:/desktop/user1/workspace# ./a.out &
[1] 16762
root#198.18.81.198:/desktop/user1/workspace# gdb --batch-silent -x=/tmp/command.txt -p 16762
[1]+ Killed ./a.out
root#198.18.81.198:/desktop/user1/workspace# ps -eaf | grep out
root 16805 1 0 08:25 pts/1 00:00:00 /desktop/user1/workspace/a.out 1
root 16823 8877 0 08:25 pts/1 00:00:00 grep --color=auto out
root#198.18.81.198:/desktop/user1/workspace# cat /home/testgrp/gdb.txt
Breakpoint 1 at 0x400711: file sample.cc, line 11.
$1 = 100
A debugging session is active.
Inferior 1 [process 16805] will be detached.
Quit anyway? (y or n) [answered Y; input not from terminal]
Question
How do I get the required information without changing the PID of the process?
More importantly, why does the pid change and previous PID is killed
Appendix
GDB version
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2aka8.0.1) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
in the last two lines of your gdb scripts
run 1
quit
"run 1" will restart the program it is debugging with argument "1". by default, it should ask you to confirm restart or not.But you have an argument "--batch-silent" when you starting gdb. so your process restart without message.
delete "--batch-silent" and last 2 lines of your gdb script then you can break and debug.
"b sample.cc:11" will stop in system function sleep, you can change it to other line if you feel confuse . (I recommend you to read some simpler demos before using gdb,so many cmd will let beginner feel confuse)

PC=0x00000000, corrupt stack in gdb, but FreeRTOS thread is still running fine on STM32

I'm developing a multi threaded application with FreeRTOS on a STM32.
When I try to debug it with OpenOCD and gdb, I can do so with all threads but my main loop.
>>> info threads
Id Target Id Frame
6 Thread 536892936 (cli) vTaskSuspend (xTaskToSuspend=<optimized out>) at /home/user1273684/dev/firmware/module/FreeRTOS/Source/tasks.c:1620
5 Thread 536888728 (wifi_loop) vTaskSuspend (xTaskToSuspend=<optimized out>) at /home/user1273684/dev/firmware/module/FreeRTOS/Source/tasks.c:1620
4 Thread 536884824 (Tmr Svc) xTaskResumeAll () at /home/user1273684/dev/firmware/module/FreeRTOS/Source/tasks.c:2126
3 Thread 536905240 (main_loop) 0x00000000 in ?? ()
2 Thread 536879832 (wifi_watchdog) xTaskResumeAll () at /home/user1273684/dev/firmware/module/FreeRTOS/Source/tasks.c:2126
* 1 Thread 536882960 (IDLE : : Running) prvIdleTask (pvParameters=<optimized out>) at /home/user1273684/dev/firmware/module/FreeRTOS/Source/tasks.c:3145
>>> thread 3
[Switching to thread 3 (Thread 536905240)]
#0 0x00000000 in ?? ()
>>> bt
#0 0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I tried increasing the stack size (vTaskList() says there is plenty of space left, configCHECK_FOR_STACK_OVERFLOW is set to 2 and vApplicationStackOverflowHook() is never triggered) but without any change.
2017-06-20 10:34:34,217 - INFO # cli R 1 922 5
2017-06-20 10:34:34,217 - INFO # IDLE R 0 235 2
2017-06-20 10:34:34,218 - INFO # wifi_watchdog B 1 231 8
2017-06-20 10:34:34,218 - INFO # main_loop B 2 2879 6
2017-06-20 10:34:34,218 - INFO # Tmr Svc S 4 320 3
2017-06-20 10:34:34,218 - INFO # wifi_loop S 3 627 4
What is going on here?

How can I see what library owns symbols for an lldb backtrace

I am trying to debug an EXC_BAD_ACCESS but can't tell what was executed using backtrace with lldb. Of course I am missing the debug symbols for those particular frames but I don't know how to figure out what library owns the address. I tried image list --address with the address of the stack frame but that doesn't return anything. Any pointers (no pun intended) would be greatly appreciated. My end goals is to hopefully see the line of code where the the segfault happened. I am doing this from command line and not from xcode btw.
Here is a snapshot of my stacktrace with the missing symbols in case my explanation wasn't making sense.
frame #0: 0x0000000103f7e2dc
frame #1: 0x0000000103f5c3d0
frame #2: 0x0000000103f5c2b3
frame #3: 0x0000000103f5c2b3
frame #4: 0x0000000103f5c2b3
frame #5: 0x0000000103f5c2b3
frame #6: 0x0000000103f5c0d8
frame #7: 0x0000000103f564e7
frame #8: 0x00000001036d6d90 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*) + 554
frame #9: 0x00000001036d6b60 libjvm.dylib`JavaCalls::call(JavaValue*, methodHandle, JavaCallArguments*, Thread*) + 40
frame #10: 0x0000000103860580 libjvm.dylib`Reflection::invoke(instanceKlassHandle, methodHandle, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 2556
frame #11: 0x00000001038609e6 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 366
frame #12: 0x00000001037236d7 libjvm.dylib`JVM_InvokeMethod + 358
frame #13: 0x0000000103f6e4b9
frame #14: 0x0000000103f5c2b3
frame #15: 0x0000000103f5c2b3
frame #16: 0x0000000103f5c961
frame #17: 0x0000000103f5c2b3
frame #18: 0x0000000103f5c2b3
frame #19: 0x0000000103f5c2b3
frame #20: 0x0000000103f5c2b3
frame #21: 0x0000000103f5c0d8
frame #22: 0x0000000103f5c0d8
Normally, the name of the library stands next to the address. Since your backtrace shows libjvm, I guess that the frames without further information are JIT-compiled Java code.

Segmentation fault as soon the binary launch

How to debug a segmentation fault caused by launching the binary on Linux?
No source code is available for the binary.
How to know the system calls made by the binary which caused the seg fault. Is there any debugging utility that might help?
In addition to what's been suggested you can also do the following:
Run ulimit -c unlimited to enable core dumping, then run your app.
At the point of segfaulting it should do a core dump.
Then you can run gdb your_app core and inside gdb run backtrace. Maybe it's been compiled with debugging symbols so you actually get quite a bit of information out.
Does strace your-program help you? It will print a list of all system calls called by your program.
Sample Output
% strace true
.
2 2 [main] true (2064) **********************************************
83 85 [main] true (2064) Program name: C:\cygwin\bin\true.exe (windows pid 2064)
44 129 [main] true (2064) OS version: Windows NT-6.1
36 165 [main] true (2064) **********************************************
145 310 [main] true (2064) sigprocmask: 0 = sigprocmask (0, 0x6123D468, 0x610FBA10)
183 493 [main] true 2064 open_shared: name shared.5, n 5, shared 0x60FF0000 (wanted 0x60FF0000), h 0x70, *m 6
27 520 [main] true 2064 heap_init: heap base 0x20000000, heap top 0x20000000, heap size 0x18000000 (402653184)
30 550 [main] true 2064 open_shared: name foo, n 1, shared 0x60FE0000 (wanted 0x60FE0000), h 0x68, *m 6
18 568 [main] true 2064 user_info::create: opening user shared for 'foo' at 0x60FE0000
17 585 [main] true 2064 user_info::create: user shared version 6467403B
36 621 [main] true 2064 fhandler_pipe::create: name \\.\dir\cygwin-c5e39b7a9d22bafb-2064-sigwait, size 164, mode PIPE_TYPE_MESSAGE
51 672 [main] true 2064 fhandler_pipe::create: pipe read handle 0x84
16 688 [main] true 2064 fhandler_pipe::create: CreateFile: name \\.\dir\cygwin-c5e39b7a9d22bafb-2064-sigwait
35 723 [main] true 2064 fhandler_pipe::create: pipe write handle 0x88
23 746 [main] true 2064 dll_crt0_0: finished dll_crt0_0 initialization

Resources