Is there any way to find what where the signal that interrupted a call to sleep() came from?
I have a ginormous amount of code, and I get this stacktrace from gdb:
#0 0x00418422 in __kernel_vsyscall ()
#1 0x001adfc6 in nanosleep () from /lib/libc.so.6
#2 0x001adde1 in sleep () from /lib/libc.so.6
#3 0x080a3cbd in MRT::setUp (this=0x9c679d8) at /code/Core/exec/mrt.cc:50
#4 0x080a1efc in main (argc=13, argv=0xbfcb6934) at /code/Core/exec/rpn.cc:211
I'm not entirely sure what all the code does, but I think this is what is going on:
Program 1 starts
Calls program 2 for shared memory allocation
Waits predetermined amount of time for allocation to complete
Program 1 continues
Find what interrupts sleep
At the time you attached GDB to the program, the sleep was in fact not interrupted by anything -- your stack trace indicates that your program is still blocked in the sleep system call.
Do you know what the sleep address is inside setup()? For example, sleep(&variable). Look for all callers of wakeup(&variable) and one of them is the sleep breaker. If there are too many, then I would add a trace array to remember the wakeups that were issued i.e. just store the PC from where wakeup was called...you can read that in the core file.
If you are sure that the sleep is interruptible && the sleep was actually interrupted, then I would do what one other poster said...catch the signal in a signal handler, capture signal info and re-arm it with the same signal.
If you are attaching to a running process, the process is interrupted by GDB itself to allow you to debug. The stack trace you observe is simply the stack of the running process at the time you attached to it. sleep() would not be an unreasonable system call for the process to be in when you are attaching to a process that appears to be idle.
If you are debugging a core file that shows the stack trace in sleep(), then when you start GDB to load a core file, it will display the top of the current stack frame of the core file. But just above that, it shows the signal that caused the core file. I wrote a test program, and this is what it showed when I loaded the core file into GDB:
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000400458 in main ()
(gdb)
A core file is just a process snapshot, it is not always due to an internal error from the code. Sometimes it is generated by a signal delivered from an external program or the shell. Sometimes it is generated by executing the command generate-core-file from within GDB. In these cases, your core file may not actually point to anything wrong, but just the state the program was in at the time the core file was created.
Related
According to the ptrace manual page:
Syscall-enter-stop and syscall-exit-stop are indistinguishable from
each other by the tracer. The tracer needs to keep track of the
sequence of ptrace-stops in order to not misinterpret syscall-enter-
stop as syscall-exit-stop or vice versa.
When I attach to a process using PTRACE_ATTACH, how do I know whether the tracee is currently in a syscall or not? Put differently, if I restart the tracee using PTRACE_SYSCALL, how do I know whether the next syscall-stop is a syscall-enter-stop or a syscall-exit-stop?
When the traced process stops on a system call ENTRY, the EAX register will contain -ENOSYS and orig_rax has the number of that system call.
Following code sample demonstrate an example.
if (registers.rax == -ENOSYS)
{ switch (registers.orig_rax)
{
case _NR_open: //Example
break;
default:
// to get the arguments
fprintf(stderr, "%#08x, %#08x, %#08x",
registers.rbx, registers.rcx,
registers.rdx);
break;
}
}
else
{
if (registers.rax < 0)
{
// error condition
fprintf(stderr, "#Err: %s\n",
errors[abs(registers.rax)]);
}
else
{
// return code
fprintf(stderr, "%#08x\n", registers.rax);
}
}
When I attach to a process using PTRACE_ATTACH, how do I know whether the tracee is currently in a syscall or not?
When you attach to a process using PTRACE_ATTACH, the tracee is sent a STOP signal.
The STOP signal can take effect while executing userspace code, when entering a syscall, while blocking in a "slow" syscall in kernel, and when returning to userspace from a syscall.
By examining the instruction pointer, and the instructions around the instruction pointer, you can usually determine whether the process was executing userspace code, but that's about it.
However, because the stop point is essentially random, you can wait for the process to stop, then single-step each of its threads using PTRACE_SINGLESTEP, until the instruction pointer changes. Then you know the thread is executing userspace code.
Alternatively, if the singlestep causes the thread to block for a long time, it means the thread is executing a slow system call that is blocking.
Put differently, if I restart the tracee using PTRACE_SYSCALL, how do I know whether the next syscall-stop is a syscall-enter-stop or a syscall-exit-stop?
You don't, unless you know the state where the tracee stopped. As I noted above, you can do that by single-stepping the code, until the instruction pointer changes.
I don't believe that you can do that with ptrace. ptrace traces, after all, that is it displays events and has no way to check the history (it has no concept of the stack of the process being traced).
But then, you could use gdb to attach to a running process in the same way.
$ gdb -p 20334
...
Attaching to process 20334
...
> bt
This would give you the stack trace of the process. Provided you have the debugging symbols of your kernel installed, you may be able to see the kernel function listet (instead of just "???").
Our multi-threaded process is deadlocked in several threads, each showing the 3 frames below at the top of the stack. GDB shows that another thread is stuck in fork (called via popen), which is presumably why malloc_atfork, instead of malloc, is being called to allocate memory.
#0 0x00007f4f02c4aeec in __lll_lock_wait_private () from
/usr/lib64/libc.so.6
#1 0x00007f4f02bc807c in _L_lock_14817 () from /usr/lib64/libc.so.6
#2 0x00007f4f02bc51df in malloc_atfork () from /usr/lib64/libc.so.6
There is a RedHat bug (https://bugzilla.redhat.com/show_bug.cgi?id=906468) about a deadlock in glibc between fork and malloc and other reports about deadlocks in malloc_atfork.
And this link, https://sourceware.org/ml/libc-alpha/2016-02/msg00269.html, from Feb, 2016, contains a patch for removing malloc_atfork.
Does anyone know a solution to this problem?
While this is a bug in glibc, it should not be able to happen except when you are calling fork from an async-signal context, where it has interrupted code that's already holding the malloc lock and the interrupted code cannot make forward progress. Otherwise, it's another thread holding the lock, and that thread should eventually make forward progress and allow the fork to continue.
Are you possibly calling popen from a signal handler? If so, that's not valid usage, and you should expect it to be able to fail in many other ways, not just this one.
I know this question has been asked before, but I have read all the threads and I didn't find an answer.
From the moment I execure run to start debugging my project, I get this : Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]. When I do ctrl+c, gdb tells me : Program received signal SIGINT, Interrupt.
0x00000000 in ?? ()
Usually it'll tell me which file and which function it got interrupted at not 0x00000000 in ?? ()
GDB no longer hits breakpoints, and what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
when I try to get info about the threads using info threads here's what I get :
[New Thread 20]
[New Thread 21]
[New Thread 22]
Id Target Id Frame
4 Thread 22 (ssp=0xa9004d5c) 0x00000000 in ?? ()
3 Thread 21 (ssp=0xa9002e64) 0x00000010 in ?? ()
2 Thread 20 (ssp=0xa9000ef4) 0x00000000 in ?? ()
The current thread <Thread ID 1> has terminated. See `help thread'
there's no thread 6, there's no * to indicate which thread gdb is using. And I don't even know if that's linked to the problem.
Can anyone please help me?
You are not providing nearly enough info to help you. Details matter, and you are withholding them. Versions of GDB and gdbserver matter, how you invoke GDB and gdbserver matter, what warnings you receive from GDB (if any) matter.
Now, this error message:
Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]
usually means that gdbserver has not attached one of the threads of your process, and that thread has tried to execute breakpoint instruction (you do have breakpoints set before this happens, don't you?).
One of the reasons this may happen is when your GDB loads "wrong" libthread_db.so (one that doesn't match the target libc.so.6).
what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
I am not sure what you mean by "same session", but it's probably not "when he types commands, they work; but when I type the same commands into the same GDB, they don't".
One difference between you and your colleague could be LD_LIBRATY_PATH environment variable setting. Another could be in ~/.gdbinit or in ./.gdbinit.
I suggest running gdb -nx to get rid of the latter, and unsetting LD_LIBRARY_PATH to get rid of the former.
The problem with the whole thing and for some reason no one seemed to notice it is this :
this is how I call gdb /usr/local/build/gdbx.y/gdb/gdb what I should be doing is this : /usr/local/build/gdbx.y/build/gdb/gdb
It was a path problem.
1st foray into using pthreads to create a multithreaded aplication
I'm trying to debug with gdb but getting some strange unexpected behaviour
Trying to ascertain whether its me or gdb at fault
Scenario:
Main thread creates a child thread.
I place a breakpoint on a line in the child thread fn
gdb stops on that breakpoint no problem
I confirm there are now 2 threads with info threads
I also check that the 2nd thread is starred, i.e. it is the current thread for gdbs purposes
Here is the problem, when I now hit n to step through to the next line in the thread fn, the parent thread (thread 1) simply resumes and completes and gdb exits.
Is this the correct behaviour?
How can I step through the thread fn code that is being executed in the 2nd thread line by line with gdb?
In other words, even though thread 2 is confirmed as the current thread by gdb, when I hit n, it seems to be the equivalent of hitting c in the parent thread, i.e. the parent thread (thread 1) just resumes execution, completes and exits.
At a loss as to how to debug multiple threads with gdb behaving as it is currently
I am using gdb from within emacs25, i.e. M-x gud-gdb
What GDB does here depends on your settings, and also your system (some vendors patch this area).
Normally, in all-stop mode, when the inferior stops, GDB stops all the threads. This gives you the behavior that you'd "expect" -- you can switch freely between threads and see what is going on in each one.
When the inferior continues, including via next or step, GDB lets all threads run. So, if your second thread doesn't interact with your first thread in any way (no locks, etc), you may well see it exit.
However, you can control this using set scheduler-locking. Setting this to on will make it so that only the current thread can be resumed. And, setting it to step will make it so that only the current thread can be resumed by step and next, but will let all threads run freely on continue and the like.
The default mode here is replay, which is basically off, except when using record-and-replay mode. However, the Fedora GDB is built with the default as step; I am not sure if other distros followed this, but you may want to check.
Yes, this is correct behaviour of gdb. You are only debugging currently active thread, other threads are executing normally behind the scenes. Think about it, how else would you move other threds?
But your code has a bug. Your parent thread should not exit before child thread is done. The best way to do this is to join child thread in the main thread before exiting.
I'm pretty much using GDB for the first time.
I run
$ gdb
then I'm running
attach <mypid>
then I see that my process is stuck (which is probably ok). Now I want it to continue running, so I run
continue
and my process continues running
but from here I'm stuck if I want again to watch my current stack trace etc. I couldn't get out of continuing... I tried Ctrl-D etc. but nothing worked for me... (was just a guess).
You should interrupt the process that is attached by gdb.
Do not interrupt gdb itself.
Interrupt the process by either ctrl-c in the terminal in
which the process was started or send the process the SIGINT
by kill -2 procid. With procid the id of the process being attached.
Control+C in the gdb process should bring you back to the command prompt.
Here's a short GDB tutorial, and here's a full GDB manual.
The point of debugging is to inspect interesting/suspicious parts of the program. Breakpoints allow you to stop execution at some source location, and watchpoints allow you to stop when interesting data changes.
Simple examples:
(gdb) break my_function
(gdb) cont
This will insert a breakpoint at the beginning of my_function, so when execution of the program enters the function the program will be suspended and you get GDB prompt back, and be able to inspect program's state. Or you can step through the code.
(gdb) watch my_var
(gdb) cont
Same, but now the program will be stopped at whatever location that modifies the value of my_var.
Shameless plug - here's a link to my GDB presentation at NYC BSD User Group. Hope this helps.
interrupt
gdb> help interrupt
Interrupt the execution of the debugged program.
If non-stop mode is enabled, interrupt only the current thread,
otherwise all the threads in the program are stopped. To
interrupt all running threads in non-stop mode, use the -a option.
interrupt cmd also send SIGINT to debugged process.
gdb> info thread
Cannot execute this command while the target is running.
Use the "interrupt" command to stop the target
and then try again.
gdb> interrupt
[New Thread 27138.27266]
[New Thread 27138.27267]
[New Thread 27138.27268]
[New Thread 27138.27269]
[New Thread 27138.27270]
Thread 1 "loader" received signal SIGINT, Interrupt.
0x0000007fb7c02e90 in nanosleep () from target:/system/lib64/libc.so