I have the following problem with my C program: Somewhere is a stack overflow. Despite compiling without optimization and with debugger symbols, the program exits with this output (within or outside of gdb on Linux):
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
The only way I could detect that this actually is stack overflow was running the program through valgrind. Is there any way I can somehow force the operating system to dump a call stack trace which would help me locate the problem?
Sadly, gdb does not allow me to easily tap into the program either.
If you allow the system to dump core files you can analyze them with gdb:
$ ulimit -c unlimited # bash sentence to allow for infinite sized cores
$ ./stack_overflow
Segmentation fault (core dumped)
$ gdb -c core stack_overflow
gdb> bt
#0 0x0000000000400570 in f ()
#1 0x0000000000400570 in f ()
#2 0x0000000000400570 in f ()
...
Some times I have seen a badly generated core file that had an incorrect stack trace, but in most cases the bt will yield a bunch of recursive calls to the same method.
The core file might have a different name that could include the process id, it depends on the default configuration of the kernel in your current system, but can be controlled with (run as root or with sudo):
$ sysctl kernel.core_uses_pid=1
With GCC you can try this:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to functions with vulnerable objects. This includes functions that call alloca, and functions with buffers larger than 8 bytes. The guards are initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is printed and the program exits.
-fstack-protector-all
Like -fstack-protector except that all functions are protected.
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html#Optimize-Options
When a program dies with SIGSEGV, it normally dumps core on Unix. Could you load that core into debugger and check the state of the stack?
Related
I have a big piece of code which has some network operations in it and I can't paste it here.
My problem is when I start it with gdb it shows seg fault as soon as process starts. But when I run it without gdb, it keeps running and at some random time it seg faults. What may be the reason ? Is there some memory corruption for sure?
One likely reason the process immediately crashes inside GDB is that GDB disables address space randomization ASLR.
You can re-enable ASLR in gdb like so:
(gdb) set disable-randomization off
(gdb) run
You can disable ASLR outside of GDB like so:
setarch x86_64 -R ./a.out ...
Or you can disable ASLR system wide like so:
sudo -c "echo 0 > /proc/sys/kernel/randomize_va_space"
Is there some memory corruption for sure?
There is a bug somewhere for sure. Whether it's memory corruption or some other bug depends on exactly how and where the process crashes, and you haven't told us any relevant details.
How can I debug a C application that does not crash when attached with gdb and run inside of gdb?
It crashes consistently when run standalone - even the same debug build!
A few of us are getting this error with a C program written for BSD/Linux, and we are compiling on macOS with OpenSSL.
app(37457,0x7000017c7000) malloc: *** mach_vm_map(size=13835058055282167808) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
ERROR: malloc(buf->length + 1) failed!
I know, not helpful.
Recompiling the application with -g -rdynamic gives the same error. Ok, so now we know it isn't because of a release build as it continues to fail.
It works when running within a gdb debugging session though!!
$ sudo gdb app
(gdb) b malloc_error_break
Function "malloc_error_break" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc_error_break) pending.
(gdb) run -threads 8
Starting program: ~/code/app/app -threads 8
[New Thread 0x1903 of process 45436]
warning: unhandled dyld version (15)
And it runs for hours. CTRL-C, and run ./app -threads 8 and it crashes after a second or two (a few million iterations).
Obviously there's an issue within one of the threads. But those workers for the threads are pretty big (a few hundred lines of code). Nothing stands out.
Note that the threads iterate over loops of about 20 million per second.
macOS 10.12.3
Homebrew w/GNU gcc and openssl (linking to crypto)
Ps, not familiar with C too much - especially any type of debugging. Be kind and expressive/verbose in answers. :)
One debugging technique that is sometimes overlooked is to include debug prints in the code, of course it has it's disadvantages, but also it has advantages. A thing you must keep in mind though in the face of abnormal termination is to make sure the printouts actually get printed. Often it's enough to print to stderr (but if that doesn't make the trick one may need to fflush the stream explicitly).
Another trick is to stop the program before the error occurs. This requires you to know when the program is about to crash, preferably as close as possible. You do this by using raise:
raise(SIGSTOP);
This does not terminate the program, it just suspends execution. Now you can attach with gdb using the command gdb <program-name> <pid> (use ps to find the pid of the process). Now in gdb you have to tell it to ignore SIGSTOP:
> handle SIGSTOP ignore
Then you can set break-points. You can also step out of the raise function using the finish command (may have to be issued multiple times to return to your code).
This technique makes the program have normal behaviour up to the time you decide to stop it, hopefully the final part when running under gdb would not alter the behavior enuogh.
A third option is to use valgrind. Normally when you see these kind of errors there's errors involved that valgrind will pick up. These are accesses out of range and uninitialized variables.
Many memory managers initialise memory to a known bad value to expose problems like this (e.g. Microsoft's CRT will use a range of values (0xCD means uninitialised, 0xDD means already free etc).
After each use of malloc, try memset'ing the memory to 0xCD (or some other constant value). This will allow you to identify uninitialised memory more easily with the debugger. don't use 0x00 as this is a 'normal' value and will be harder to spot if it's wrong (it will also probably 'fix' your problem).
Something like:
void *memory = malloc(sizeof(my_object));
memset(memory, 0xCD, sizeof(my_object));
If you know the size of the blocks, you could do something similar before free (this is sometimes harder unless you know the size of your objects, or track it in some way):
memset(memory, 0xDD, sizeof(my_object));
free(memory);
So when glibc crashes, it has a *glibc detected * crash message. It then prints a bunch of backtraces, like
*** glibc detected *** ./odin: free(): invalid pointer: 0xbfba4444 ***
======= Backtrace: =========
/lib/tls/i686/cmov/libc.so.6(+0x6b161)[0xb75f9161]
/lib/tls/i686/cmov/libc.so.6(+0x6c9b8)[0xb75fa9b8]
/lib/tls/i686/cmov/libc.so.6(cfree+0x6d)[0xb75fda9d]
/usr/lib/libstdc++.so.6(_ZdlPv+0x1f)[0xb77da2ef]
All well and good, but other cases when things crash, I've been doing backtrace() and then using a system call to addr2line and printing the actual points in the function instead. But when it's a glibc crash, it quits bypassing any signal handlers I called.
Is there a way to hook against these glibc crashes?
That is an option for memory functions, you can toggle it using mallopt. By the sounds of it you want to set M_CHECK_ACTION to zero to allow execution to continue, unless you want the program to exit straight away in which case see if 2 allows you to do what you want.
This small program produces the normal glibc error: test1.c
This one ignores the error and carries on: test2.c
This one aborts on the error: test3.c
IIRC, glibc actually invokes abort(), so handling SIGABRT and printing a backtrace from that should give you the information you need.
However, I'd suggest trying valgrind: the message you're getting suggests that you have a memory corruption problem.
Side comment (sorry if this is redundant ;-)): core dumps are sometimes more useful that just backtraces. They can be enabled by e.g. setting ulimit -c unlimited in bash. When the program crashes it will produce a file named core. (or just core -- this depends on the system you're running; if your system runs abrtd core files are put into /var/cache/abrt if I'm not mistaken). Then you can inspect core files using gdb by running gdb -c core a.out; the gdb session will look like as if the process just have crashed.
We have an embedded version of Linux kernel running on a MIPs core. The Programme we have written runs a particular test suite. During one of the stress tests (runs for about 12hrs) we get a seg fault. This in turn generates a core dump.
Unfortunately the core dump is not very useful. The crash is in some system library that is dynamically linked (probably pthread or glibc). The backtrace in the core dump is not helpful because it only shows the crash point and no other callers (our user space app is built with -g -O0, but still no back trace info):
Cannot access memory at address 0x2aab1004
(gdb) bt
#0 0x2ab05d18 in ?? ()
warning: GDB can't find the start of the function at 0x2ab05d18.
GDB is unable to find the start of the function at 0x2ab05d18
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
This problem is most likely caused by an invalid program counter or
stack pointer.
However, if you think GDB should simply search farther back
from 0x2ab05d18 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
Another unfortunate-ness is that we cannot run gdb/gdbserver. gdb/gdbserver keeps breaking on __nptl_create_event. Seeing that the test creates threads, timers and destroys then every 5s it is almost impossible to sit for a long time hitting continue on them.
EDIT:
Another note, backtrace and backtrace_symbols is not supported on our toolchain.
Hence:
Is there a way of trapping seg fault and generate more backtrace data, stack pointers, call stack, etc.?
Is there a way of getting more data from a core dump that crashed in a .so file?
Thanks.
GDB can't find the start of the function at 0x2ab05d18
What is at that address at the time of the crash?
Do info shared, and find out if there is a library that contains that address.
The most likely cause of your troubles: did you run strip libpthread.so.0 before uploading it to your target? Don't do that: GDB requires libpthread.so.0 to not be stripped. If your toolchain contains libpthread.so.0 with debug symbols (and thus too large for the target), run strip -g on it, not a full strip.
Update:
info shared produced Cannot access memory at address 0x2ab05d18
This means that GDB can not access the shared library list (which would then explain the missing stack trace). The most usual cause: the binary that actually produced the core does not match the binary you gave to GDB. A less common cause: your core dump was truncated (perhaps due to ulimit -c being set too low).
If all else fails run the command using the debugger!
Just put "gdb" in form of your normal start command and enter "c"ontinue to get the process running. When the task segfaults it will return to the interactive gdb prompt rather than core dump. You should then be able to get more meaningful stack traces etc.
Another option is to use "truss" if it is available. This will tell you which system calls were being used at the time of the abend.
consider the following code in C
int n;
scanf("%d",n)
it gives the error Segmentation fault core dumped in GCC compiler in Linux Mandriva
but the following code
int *p=NULL;
*P=8;
gives only segmentation fault why is that so..?
A core dump is a file containing a dump of the state and memory of a program at the time it crashed. Since core dumps can take non-trivial amounts of disk space, there is a configurable limit on how large they can be. You can see it with ulimit -c.
Now, when you get a segmentation fault, the default action is to terminate the process and dump core. Your shell tells what has happened, if a process has terminated with a segmentation fault signal it will print Segmentation fault, and if that process has additionally dumped core (when the ulimit setting and the permissions on the directory where the core dump is to be generated allow it), it will tell you so.
Assuming you're running both of these on the same system, with the same ulimit -c settings (which would be my first guess as to the difference you're seeing), then its possible the optimizer is "noticing" the clearly undefined behavior in the second example, and generating its own exit. You could check with objdump -x.
In the first case 'n' could have any value, you might own this memory (or not), it might be writeable (or not) but it probably exists. There is no reason that n is necessarily zero.
Writing to NULL is definetly naughty and something the OS is going to notice!