I'm trying to debug a seg fault in C in a research code. I cannot modify the source code/makefile. Since I can't modify the makefile (i.e., recompile the program) and the executable was not compiled with the -g option, I assume that throws gdb debugging out the window? Or is there a way to use gdb without compiling the executable using -g?
I could request to make changes to the source code, but I am almost certain the seg fault is due to one of my input files, so it shouldn't be a source code problem.
Someone had suggested I use "strace," which I am not very familiar with. Here is the end of the output when I strace'd my program:
close(27) = 0
munmap(0x2abe4843d000, 65536) = 0
write(2, "==== backtrace ====\n", 20==== backtrace ====
) = 20
write(2, " 2 0x00000000000597bc mxm_handle"..., 113 2 0x00000000000597bc mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.3.3055/src/mxm/util/debug/debug.c:641
) = 113
write(2, " 3 0x000000000005992c mxm_error_"..., 121 3 0x000000000005992c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.3.3055/src/mxm/util/debug/debug.c:616
) = 121
write(2, " 4 0x00000037ccc326a0 killpg() "..., 37 4 0x00000037ccc326a0 killpg() ??:0
) = 37
write(2, " 5 0x00000000004ec6ef interpLema"..., 99 5 0x00000000004ec6ef interpLemansToMopar_linear() /home/dzdang/w16/sources/mopar_bc_interp.c:559
) = 99
write(2, " 6 0x000000000040c4ee main() /h"..., 68 6 0x000000000040c4ee main() /home/dzdang/w16/sources/lemans.c:611
) = 68
write(2, " 7 0x00000037ccc1ed5d __libc_sta"..., 48 7 0x00000037ccc1ed5d __libc_start_main() ??:0
) = 48
write(2, " 8 0x0000000000403c99 _start() "..., 37 8 0x0000000000403c99 _start() ??:0
) = 37
write(2, "===================\n", 20===================
) = 20
brk(0x2958000) = 0x2958000
tgkill(15432, 15432, SIGSEGV) = 0
rt_sigreturn(0x3c48) = 46993935941696
--- SIGSEGV (Segmentation fault) # 0 (0) ---
+++ killed by SIGSEGV +++
Segmentation fault
Any ideas what this means? Or any suggestions on how to debug?
"...but I am almost certain the seg fault is due to one of my input files"
Then your debugging should concentrate on your input files. Is there a specification of the input?
If you have many input files and checking the files manually would be unfeasible, you could write a validator in C that checks all the input files for the proper format and reports erors. With those validated files the program in question won't crash anymore (hopefully).
(EDIT)
As for the debugging of the inputs, try a minimal input and expand that untill you are at the complete input. Somewhere a crash will occur that may give you have an indication.
I assume that throws gdb debugging out the window?
I am almost certain the seg fault is due to one of my input files, so it shouldn't be a source code problem.
You are making a lot of unwarranted assumptions.
A SIGSEGV is always a source code problem: invalid input should produce an error, not a crash.
The output from strace that you show contains file and line info, which usually means that the program is in fact compiled with -g.
GDB is perfectly capable of debugging programs compiled without -g, but it requires a skilled operator.
The program appears to self-report an error of some kind. Unfortunately you've removed all the relevant parts of that error, and show only the stack trace (which doesn't mean anything without the earlier output).
What you should do:
Stop making unwarranted assumptions and guesses.
Edit your question (or start a new one), showing the error that the program actually reports.
Run the program under GDB, and observe that you can in fact see file / line info, and likely parameter and local variable values.
Read the source, understand how and why it may crash, examine variables at point of crash, understand the cause (i.e. actually debug the problem).
Related
We're learning to use GDB in my Computer Architecture class. To do this we do most of our work by using SSH to connect to a raspberry pi. When running GDB on some code he gave us to debug though it ends with an error message on how it can't find raise.c
I've tried:
installing libc6, libc6-dbg (says they're already up-to-date)
apt-get source glibc (gives me: "You must put some 'source' URIs in your sources.list")
https://stackoverflow.com/a/48287761/12015458 (apt source returns same thing as the apt-get source above, the "find $PWD" command the user gave returns nothing)
I've tried looking for it manually where told it may be? (/lib/libc doesn't exist for me)
This is the code he gave us to try debugging on GDB:
#include <stdio.h>
main()
{
int x,y;
y=54389;
for (x=10; x>=0; x--)
y=y/x;
printf("%d\n",y);
}
However, whenever I run the code in GDB I get the following error:
Program received signal SIGFPE, Arithmetic exception.
__GI_raise (sig=8) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
I asked him about it and he didn't really have any ideas on how to fix it.
It does not really matter that the source for raise() is not found. It would only show you the line where the exception is finally raised, but not the place where the error is triggered.
Run the erroneous program again in GDB. And when the exception is raised, investigate the call stack and the stackframes with GBDs commands. This is the point in your task, so I won't give you more than this hint.
If you're clever you can see the error in the given source just by looking at it. ;-)
When GDB does not know any symbol, you need to compile with the option -g to get debugger support.
EDIT
Now on a Windows system this is my log (please excuse the colouring, I didn't found a language selector for pure text):
D:\tmp\StackOverflow\so_027 > type crash1.c
#include <stdio.h>
main()
{
int x,y;
y=54389;
for (x=10; x>=0; x--)
y=y/x;
printf("%d\n",y);
}
D:\tmp\StackOverflow\so_027 > gcc crash1.c -g -o crash1.out
crash1.c:2:1: warning: return type defaults to 'int' [-Wimplicit-int]
main()
^~~~
D:\tmp\StackOverflow\so_027 > dir
[...cut...]
04.09.2019 08:33 144 crash1.c
04.09.2019 08:40 54.716 crash1.out
D:\tmp\StackOverflow\so_027 > gdb crash1.out
GNU gdb (GDB) 8.1
[...cut...]
This GDB was configured as "x86_64-w64-mingw32".
[...cut...]
Reading symbols from crash1.out...done.
(gdb) run
Starting program: D:\tmp\StackOverflow\so_027\crash1.out
[New Thread 4520.0x28b8]
[New Thread 4520.0x33f0]
Thread 1 received signal SIGFPE, Arithmetic exception.
0x0000000000401571 in main () at crash1.c:7
7 y=y/x;
(gdb) backtrace
#0 0x0000000000401571 in main () at crash1.c:7
(gdb) help stack
Examining the stack.
The stack is made up of stack frames. Gdb assigns numbers to stack frames
counting from zero for the innermost (currently executing) frame.
At any time gdb identifies one frame as the "selected" frame.
Variable lookups are done with respect to the selected frame.
When the program being debugged stops, gdb selects the innermost frame.
The commands below can be used to select other frames by number or address.
List of commands:
backtrace -- Print backtrace of all stack frames
bt -- Print backtrace of all stack frames
down -- Select and print stack frame called by this one
frame -- Select and print a stack frame
return -- Make selected stack frame return to its caller
select-frame -- Select a stack frame without printing anything
up -- Select and print stack frame that called this one
Type "help" followed by command name for full documentation.
Type "apropos word" to search for commands related to "word".
Command name abbreviations are allowed if unambiguous.
(gdb) next
Thread 1 received signal SIGFPE, Arithmetic exception.
0x0000000000401571 in main () at crash1.c:7
7 y=y/x;
(gdb) next
[Inferior 1 (process 4520) exited with code 030000000224]
(gdb) next
The program is not being run.
(gdb) quit
D:\tmp\StackOverflow\so_027 >
Well, it marks directly the erroneous source line. That is different to your environment as you use a Raspi. However, it shows you some GDB commands to try.
Concerning your video:
It is clear that inside raise() you can't access x. That's why GDB moans about it.
If an exception is raised usually the program is about to quit. So there is no value in stepping forward.
Instead, as shown in my log, use GDB commands to investigate the stack frames. I think this is the issue you are about to learn.
BTW, do you know that you should be able to copy the screen content? This will make reading so much easier for us.
From a practical standpoint the other answer is correct, but if you do want the libc sources:
apt-get source is the right way to get the sources of libc, but yes, you do need to have source repositories configured in /etc/apt/sources.list.
If you're using Ubuntu, see the deb-src lines in https://help.ubuntu.com/community/Repositories/CommandLine
For debian, see https://wiki.debian.org/SourcesList#Example_sources.list
Then apt-get source should work. Remember to tell GDB where those sources are using the "directory" command.
I had a program that was segfaulting.
When I went to investigate and ran dmesg I could see lines like this:
[955.915050] traps: foo_bar[123] general protection ip:7f5fcc2d4306 sp:7ffd9e5868b8 ...
Now the program has been fixed and I'm trying to write some analysis scripts across different systems to find similar messages and was hoping to induce a line in the dmesg log to get a baseline for what to look for and see if there's a difference between, say, a sigbus(10) and a sigill(4)
I tried to do it via kill -11 on the command line . No entry in dmesg
I tried to do it via signal(getpid(), 11) in the code. No entry in dmesg
I tried to do it via signal 11 after attaching in gdb . No entry in dmesg
I tried to do it via writing bad code and it worked for SEGV, but I can't figure out how to trigger a SIGBUS (for example)
I'm guessing that there is more than one path for handling the signal depending on how it occurs and my attempts above just aren't doing it the right way.
How can I trigger/send a signal to my program that'll get a line in dmesg? Is there some kernel or log configuration I can twiddle to get those lines?
Update:
" __builtin_trap: when to use it? " shows how to get a SIGILL but alas doesn't have a signal-agnostic solution)
I'm trying to find a segmentation fault in my program that doesn't happen all the time. I'm trying to run my program in a loop in gdb until the segmentation fault happens.
My problem is that the gdb continues the while loop after receiving the seg fault and doesn't prompt me with the gdb shell.
when I run my gdb I use:
set $i=0
while($i<100)
set $i = $i+1
r
end
Anybody know how to make the gdb stop at first segfault and not run 100 times??
Thanks!
The gdb documentation is huge and it's difficult to find what you want but I could make that happen, and just by tweaking your script slightly.
Upon completion, gdb sets $_exitcode to the exit code value.
If segv occurs, the value isn't changed. So my idea was to set it to some stupid value (I chose 244) and run. But if return code is still 244 after the run command, then exit the loop (maybe there's another way to do it)
Warning: hack ahead (but that works)
set $i=0
while($i<100)
set $i = $i+1
set $_exitcode = 244
r
if $_exitcode==244
set $i = 200
end
end
I tested that with an interactive program. Type n for normal execution, and y to trigger segfault (well it would not trigger it, but there's a good chance for that to happen)
#include <stdio.h>
#include <stdlib.h>
int main()
{
printf("want segfault?\n");
char c = getchar();
if (c=='y')
{
printf("%s", 'a'); // this is broken on purpose, to trigger segfault
}
return 0;
}
testing in a gdb session:
(gdb) source gdbloop.txt
[New Thread 6216.0x1d2c]
want segfault?
n
[Inferior 1 (process 6216) exited normally]
[New Thread 7008.0x1264]
want segfault?
n
[Inferior 1 (process 7008) exited normally]
[New Thread 8000.0x2754]
want segfault?
y
Breakpoint 1, 0x76b2d193 in wtoi () from C:\windows\syswow64\msvcrt.dll
(gdb)
so I get the prompt back when a segfault is triggered.
You can script GDB interaction using expect.
But the solution from this answer should really be all you need here.
break on exit didn't work for me
It's possible that your program calls _exit instead of exit, so you may need to set a breakpoint there.
It's also possible that your program executes direct SYS_exit system call without going through either exit or _exit.
On Linux, you can catch this with:
catch syscall exit
catch syscall exit_group
At least one the four variants should fire (just run a program by hand). Once you know which variant actually fires, attach commands to the corresponding breakpoint, and use the solution above.
I am working on Pintos OS project. I get this message:
Page fault at 0xbfffefe0: not present error writing page in user context.
The problem with Pintos OS project is that it won't simply tell the line and method that caused the exception.
I know how to use breakpoints/watchpoints etc. but is there any way to step right to it without going through the WHOLE flow and ALL OS files line by line so that I could jump into line that caused exception and put breakpoint there? I looked at GDB commands but didn't find anything.
When I debug this project I have to step through the whole program until I find that error/exception which is very time consuming. There is probably a faster way to do this.
Thanks.
Whole trace:
nestilll#vdebian:~/Class/pintos/proj-3-bhling-nestilll-nsren/src/vm/build$ pintos -v -k -T 60 --qemu --gdb --filesys-size=2 -p tests/vm/pt-grow-pusha -a pt-grow-pusha --swap-size=4 -- -q -f run pt-grow-pusha
Use of literal control characters in variable names is deprecated at /home/nestilll/Class/pintos/src/utils/pintos line 909.
Prototype mismatch: sub main::SIGVTALRM () vs none at /home/nestilll/Class/pintos/src/utils/pintos line 933.
Constant subroutine SIGVTALRM redefined at /home/nestilll/Class/pintos/src/utils/pintos line 925.
warning: disabling timeout with --gdb
Copying tests/vm/pt-grow-pusha to scratch partition...
qemu -hda /tmp/N2JbACdqyV.dsk -m 4 -net none -nographic -s -S
PiLo hda1
Loading............
Kernel command line: -q -f extract run pt-grow-pusha
Pintos booting with 4,088 kB RAM...
382 pages available in kernel pool.
382 pages available in user pool.
Calibrating timer... 419,020,800 loops/s.
hda: 13,104 sectors (6 MB), model "QM00001", serial "QEMU HARDDISK"
hda1: 205 sectors (102 kB), Pintos OS kernel (20)
hda2: 4,096 sectors (2 MB), Pintos file system (21)
hda3: 98 sectors (49 kB), Pintos scratch (22)
hda4: 8,192 sectors (4 MB), Pintos swap (23)
filesys: using hda2
scratch: using hda3
swap: using hda4
Formatting file system...done.
Boot complete.
Extracting ustar archive from scratch device into file system...
Putting 'pt-grow-pusha' into the file system...
Erasing ustar archive...
Executing 'pt-grow-pusha':
(pt-grow-pusha) begin
Page fault at 0xbfffefe0: not present error writing page in user context.
pt-grow-pusha: dying due to interrupt 0x0e (#PF Page-Fault Exception).
Interrupt 0x0e (#PF Page-Fault Exception) at eip=0x804809c
cr2=bfffefe0 error=00000006
eax=bfffff8c ebx=00000000 ecx=0000000e edx=00000027
esi=00000000 edi=00000000 esp=bffff000 ebp=bfffffa8
cs=001b ds=0023 es=0023 ss=0023
pt-grow-pusha: exit(-1)
Execution of 'pt-grow-pusha' complete.
Timer: 71 ticks
Thread: 0 idle ticks, 63 kernel ticks, 8 user ticks
hda2 (filesys): 62 reads, 200 writes
hda3 (scratch): 97 reads, 2 writes
hda4 (swap): 0 reads, 0 writes
Console: 1359 characters output
Keyboard: 0 keys pressed
Exception: 1 page faults
Powering off...
to have the GDB debugger run and stop at the desired location:
gdb filename <--start debug session
br main <--set a breakpoint at the first line of the main() function
r <--run until that breakpoint is reached
br filename.c:linenumber <--set another breakpoint at the desired line of code
c <--continue until second breakpoint is encuntered
The debugger will stop at the desired location in the file, IF it ever actually gets there,
When I debug this project I have to step through the whole program
until I find what caused error/exception which is very time consuming.
There is probably a faster way to do this.
Normally what you would do is set a breakpoint just before the error. Then your program will run at full speed, without your intervention, until it reaches that point.
There are several wrinkles here.
First, sometimes it is difficult to know where to put the breakpoint. In this case I suppose I would look for the code that is printing the message, then work backward from there. Sometimes you have to stop at the failure point, examine the stack, set a new breakpoint further up, and re-run the program.
Then there is the mechanics of setting the breakpoint. One simple way is to break by function name, like break my_function. Another is to use the file name and line number, like break my_file.c:73.
Finally, sometimes a breakpoint can be hit many times before the failure is seen. You can use ignore counts (see help ignore) or conditional breakpoints (like break my_function if variable = 27) to limit the number of stops.
This question already has answers here:
What is the origin of magic number 42, indispensable in coding? [closed]
(6 answers)
Closed 6 years ago.
Why do we use 42 as an argument of exit while exiting the process? I am wondering is it some macro value (like 1 is value of EXIT_FAILURE macro) or it has some deeper meaning?
if(pid == 0) {
printf("something\n");
exit(42);
}
It is kind of clear that it doesn't matter if I use exit(1) or exit(42), but why just 42?
Any number except for 0 would have done. But 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything.
Very popular among IT people...
But why did Douglas Adams pick 42?
I sat at my desk, stared into the garden and thought '42 will do'. I
typed it out. End of story
Such magic value may be used to indicate exact exit reason to parent process. You may threat it like a some kind of minimalistic IPC. Of course both processes must agree about actual values and their meanings, as well as do not use special reserved exit codes.
It is kind of clear that it doesn't matter if I use exit(1) or exit(42)
It actually matters a lot.
The exit code can be used by the process that launches the exiting process to know how it completed and why it failed.
The process that launches your program can inspect the value of the environment variable $? immediately after your program completes to know if it succeeded or why it failed, if it didn't succeed.
Let's say your program downloads a file from a remote site and stores it in a local directory. It expects to use an existing local directory and it doesn't attempt to create it if it doesn't exist. It can exit, for example, with code 37 when the remote file cannot be downloaded because the remote site return 404 Not Found, with code 62 when it cannot download the file because the network is down (or a timeout happens) and code 41 when the local directory does not exist.
A bash script, for example, that invokes your program can check the value of the environment variable $? immediately after your program completes. If its value is 37 (remote file is not found) it must not attempt to retry because the error is permanent. On exit code 62 (network issues) it can wait a couple of seconds and try again (the error condition is transient, it could disappear after a while). On exit code 41 (local directory not found) it can create the local directory then launch your program again (a precondition was not met).