Analyzing Core dump for memory leak - c

I have several core dump files created by manually killing a memory leaking process. I'm trying to open it with GDB, however gdb reports that (no debuggung symbols found). From what I understand, that means that the program was compiled without -g option, which is correct, and because of that, GDB has nothing to catch. I however, want to only open the core dump file, I need to read it in order to find some sort of memory leak. I can try to recompile program with -g flag, however following executable will no longer be the same as the one that produced the core dump file.
When I try to do a backtrace, I get this
#0 0x0000003c992325e5 in ?? () from /lib64/libc.so.6
#1 0x0000003c99233dc5 in abort () from /lib64/libc.so.6
#2 0x00007f961117d3f2 in PrepareDumpAreas () from /opt/mqm/lib64/libmqe_r.so
#3 0x0000000000000000 in ?? ()
that tells me, that he is for some reason unable to read the executable I provided, but thats impossible because I'm sure the exe is correct. Might this be another result of the fact that it was not compiled for debugging?
My Question is: Is there another way to read dump core files? What can I do to make GDB work the way I need.
EDIT1: I also ran my own set of tests and watched, if the memory requirements for the process increased. On my enviroment, no leak was apparent. So it is specific for enviroment of my client (and, perhaps, specific to message loads that my program has to carry out)

Related

debugging C program with gdb

I'm trying to test a scheduler that I wrote. I schedule two processes - both are infinite while loops (just while(1) statements). When I run the program sometimes it segfaults after like ten seconds (sometimes 5 sec, sometimes 15 or more). Sometimes it doesn't segfault at all and runs as expected. I have a log file which shows me that both processes are scheduled as expected before the segfault occurs. I'm trying to debug the errors using gdb but it's not being very helpful. Here's what I got with backtrace:
#0 0x00007ffff7ff1000 in ?? ()
#1 0x000000000000002b in ?? ()
#2 0x00007ffff78b984a in new_do_write () from /lib64/libc.so.6
#3 0x000000000061e3d0 in ?? ()
#4 0x0000000000000000 in ?? ()
I don't really understand #2.
I think this may be a stack overflow related error. However, I only malloc twice in the whole process - both times when I'm setting up the two processes, I malloc a pcb block in the pcb table I wrote. Has anyone run into similar issues before? Could this be something with how I'm setting/swapping the contexts in the scheduler? Why does it segfault sometimes, and sometimes not?
You didn't tell how you obtained the stack trace that you show in the question.
It is very likely that the stack trace is bogus not because the stack is corrupt, but because you've invoked GDB incorrectly, e.g. specified wrong executable when attaching the process or examining core dump.
One common mistake is to build the executable with -O2 (let's call this executable E1), then rebuild it with -g (let's call this E2) and try to analyze core of live process that is running E1 giving GDB E2 as the symbol file.
Don't do that, it doesn't work and isn't expected to work.
Since your stack seems corrupted, you're probably correct that you have a stack buffer overflow somewhere. Without the code, it's a little difficult to tell.
But this has nothing to do with your malloc calls. Overflowing dynamically allocated buffers would corrupt the heap, not the stack.
Whay you'll probably need to be looking at is local variables that aren't big enough for the data you're trying to copy in to them, like:
char xyzzy[5];
strcpy (xyzzy, "this is a bad idea";
Or passing a buffer (again, most likely on the stack) to a system call that writes more data to it than you provide for.
They're the most likely causes though theoretically, of course, any undefined behaviour on your part could cause this. If the solution is not evident based on this answer, you'll probably need to post the code that caused it. Try to ensure you trim it down as much as possible when you do that so that it's the shortest complete program that exhibits the bug.
Often you'll find by doing that, the problem becomes evident :-)

How to force a program compiled with '-pg' dump its stat info when it is still running?

I'm developing in C++(g++) with a non-opensource lib.
every time I run the program, the lib will crash (it double-free some memory).
it's ok for my program now. but it's bad for profiling. I use -pg to profiling the program. As a result of the crash, no 'gmon.out' is generated. so I cannot profile it at all.
Question:
How to profiling a 'crashy' program (with gprof).
PS. valgrind is ok to analysis a crashy program.
regards!
There's a function you can call from your program to dump profile data (the same one that's automatically installed as an atexit handler when you link with -pg), but I don't know what it's called offhand.
The easyist thing to do it, just insert an exit(0); call at a suitable point in your program. Alternatively, you can set a breakpoint and use call exit(0) in GDB (except that debugging the program will affect the profile data if you stop it in the middle).

Getting better debug when Linux crashes in a C programme

We have an embedded version of Linux kernel running on a MIPs core. The Programme we have written runs a particular test suite. During one of the stress tests (runs for about 12hrs) we get a seg fault. This in turn generates a core dump.
Unfortunately the core dump is not very useful. The crash is in some system library that is dynamically linked (probably pthread or glibc). The backtrace in the core dump is not helpful because it only shows the crash point and no other callers (our user space app is built with -g -O0, but still no back trace info):
Cannot access memory at address 0x2aab1004
(gdb) bt
#0 0x2ab05d18 in ?? ()
warning: GDB can't find the start of the function at 0x2ab05d18.
GDB is unable to find the start of the function at 0x2ab05d18
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
This problem is most likely caused by an invalid program counter or
stack pointer.
However, if you think GDB should simply search farther back
from 0x2ab05d18 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
Another unfortunate-ness is that we cannot run gdb/gdbserver. gdb/gdbserver keeps breaking on __nptl_create_event. Seeing that the test creates threads, timers and destroys then every 5s it is almost impossible to sit for a long time hitting continue on them.
EDIT:
Another note, backtrace and backtrace_symbols is not supported on our toolchain.
Hence:
Is there a way of trapping seg fault and generate more backtrace data, stack pointers, call stack, etc.?
Is there a way of getting more data from a core dump that crashed in a .so file?
Thanks.
GDB can't find the start of the function at 0x2ab05d18
What is at that address at the time of the crash?
Do info shared, and find out if there is a library that contains that address.
The most likely cause of your troubles: did you run strip libpthread.so.0 before uploading it to your target? Don't do that: GDB requires libpthread.so.0 to not be stripped. If your toolchain contains libpthread.so.0 with debug symbols (and thus too large for the target), run strip -g on it, not a full strip.
Update:
info shared produced Cannot access memory at address 0x2ab05d18
This means that GDB can not access the shared library list (which would then explain the missing stack trace). The most usual cause: the binary that actually produced the core does not match the binary you gave to GDB. A less common cause: your core dump was truncated (perhaps due to ulimit -c being set too low).
If all else fails run the command using the debugger!
Just put "gdb" in form of your normal start command and enter "c"ontinue to get the process running. When the task segfaults it will return to the interactive gdb prompt rather than core dump. You should then be able to get more meaningful stack traces etc.
Another option is to use "truss" if it is available. This will tell you which system calls were being used at the time of the abend.

fatal error disappeared when running with gdb

I have a program which produces a fatal error with a testcase, and I can locate the problem by reading the log and the stack trace of the fatal - it turns out that there is a read operation upon a null pointer.
But when I try to attach gdb to it and set a breakpoint around the suspicious code, the null pointer just cannot be observed! The program works smoothly without any error.
This is a single-process, single-thread program, I didn't experience this kind of thing before. Can anyone give me some comments? Thanks.
Appended: I also tried to call pause() syscall before the fatal-trigger code, and expected to make the program sleep before fatal point and then attach the gdb on it on-the-fly, sadly, no fatal occurred.
It's only guesswork without looking at the code, but debuggers sometimes do this:
They initialize certain stuff for you
The timing of the operations is changed
I don't have a quote on GDB, but I do have one on valgrind (granted the two do wildly different things..)
My program crashes normally, but doesn't under Valgrind, or vice versa. What's happening?
When a program runs under Valgrind,
its environment is slightly different
to when it runs natively. For example,
the memory layout is different, and
the way that threads are scheduled is
different.
Same would go for GDB.
Most of the time this doesn't make any
difference, but it can, particularly
if your program is buggy.
So the true problem is likely in your program.
There can be several things happening.. The timing of the application can be changed, so if it's a multi threaded application it is possible that you for example first set the ready flag and then copy the data into the buffer, without debugger attached the other thread might access the buffer before the buffer is filled or some pointer is set.
It's could also be possible that some application has anti-debug functionality. Maybe the piece of code is never touched when running inside a debugger.
One way to analyze it is with a core dump. Which you can create by ulimit -c unlimited then start the application and when the core is dumped you could load it into gdb with gdb ./application ./core You can find a useful write-up here: http://www.ffnn.nl/pages/articles/linux/gdb-gnu-debugger-intro.php
If it is an invalid read on a pointer, then unpredictable behaviour is possible. Since you already know what is causing the fault, you should get rid of it asap. In general, expect the unexpected when dealing with faulty pointer operations.

How to detect the point of a stack overflow

I have the following problem with my C program: Somewhere is a stack overflow. Despite compiling without optimization and with debugger symbols, the program exits with this output (within or outside of gdb on Linux):
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
The only way I could detect that this actually is stack overflow was running the program through valgrind. Is there any way I can somehow force the operating system to dump a call stack trace which would help me locate the problem?
Sadly, gdb does not allow me to easily tap into the program either.
If you allow the system to dump core files you can analyze them with gdb:
$ ulimit -c unlimited # bash sentence to allow for infinite sized cores
$ ./stack_overflow
Segmentation fault (core dumped)
$ gdb -c core stack_overflow
gdb> bt
#0 0x0000000000400570 in f ()
#1 0x0000000000400570 in f ()
#2 0x0000000000400570 in f ()
...
Some times I have seen a badly generated core file that had an incorrect stack trace, but in most cases the bt will yield a bunch of recursive calls to the same method.
The core file might have a different name that could include the process id, it depends on the default configuration of the kernel in your current system, but can be controlled with (run as root or with sudo):
$ sysctl kernel.core_uses_pid=1
With GCC you can try this:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to functions with vulnerable objects. This includes functions that call alloca, and functions with buffers larger than 8 bytes. The guards are initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is printed and the program exits.
-fstack-protector-all
Like -fstack-protector except that all functions are protected.
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html#Optimize-Options
When a program dies with SIGSEGV, it normally dumps core on Unix. Could you load that core into debugger and check the state of the stack?

Resources