I simply try to open a file through a dialog using Gtk3. This is the code I am using, no modifications at all:
https://docs.gtk.org/gtk3/class.FileChooserNative.html#typical-usage-gtkfilechoosernative-typical-usage
The window opens fine but when I try to open the file, it segfaults. According to gdb:
Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
76 VPCMPEQ (%rdi), %ymm0, %ymm1
Using debuginfod of course. What am I doing wrong?
EDIT: Some backtrace from gdb:
(gdb) bt
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
#1 0x00007ffff7e1b948 in __GI__IO_puts (str=0x0) at ioputs.c:35
#2 0x00005555555551d6 in main (argc=1, argv=0x7fffffffdb28) at main.c:6
Related
Sample errors in the core dump files:
1289 vfprintf-internal.c: No such file or directory.
111 printf-parse.h: No such file or directory.
948 libioP.h: No such file or directory.
948 libioP.h: No such file or directory.
I'm working on a fast_malloc() implementation, but getting segmentation faults for unknown reasons once I override malloc() and free() with my own implementations, but NOT before that (meaning, if I call fast_malloc() it's fine, but if I want to be able to call malloc() to get my implementation, it seems to be broken).
Why the segfault?
Sample output, before ANYTHING can be printed, including the print statement at the start of main(), and some debug prints inside my fast_malloc():
Segmentation fault (core dumped)
I have turned on core dumps as I explain here.
So, gdb path/to/my/executable core shows some of the following core file info. Note that each run may result in a different statement for what file is missing in "No such file or directory."
One run:
Reading symbols from build/fast_malloc_unit_tests...
warning: core file may not match specified executable file.
[New LWP 1257155]
Core was generated by `build/fast_malloc_unit_tests'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fd50fc7ba01 in __vfprintf_internal (s=0x7fd50fdee6a0 <_IO_2_1_stdout_>,
format=0x5622fd1b8010 "DEBUG: %s():\n", ap=ap#entry=0x7ffec28300a0,
mode_flags=mode_flags#entry=0) at vfprintf-internal.c:1289
1289 vfprintf-internal.c: No such file or directory.
(gdb) bt
#0 0x00007fd50fc7ba01 in __vfprintf_internal (s=0x7fd50fdee6a0 <_IO_2_1_stdout_>,
format=0x5622fd1b8010 "DEBUG: %s():\n", ap=ap#entry=0x7ffec28300a0,
mode_flags=mode_flags#entry=0) at vfprintf-internal.c:1289
#1 0x00007fd50fc66ebf in __printf (format=<optimized out>) at printf.c:33
#2 0x00005622fd1b53eb in fast_malloc (num_bytes=1024) at src/fast_malloc.c:225
#3 0x00005622fd1b5b66 in malloc (num_bytes=1024) at src/fast_malloc.c:496
#4 0x00007fd50fc86e84 in __GI__IO_file_doallocate (fp=0x7fd50fdee6a0 <_IO_2_1_stdout_>)
at filedoalloc.c:101
#5 0x00007fd50fc97050 in __GI__IO_doallocbuf (fp=fp#entry=0x7fd50fdee6a0 <_IO_2_1_stdout_>)
at libioP.h:948
#6 0x00007fd50fc960b0 in _IO_new_file_overflow (f=0x7fd50fdee6a0 <_IO_2_1_stdout_>, ch=-1)
at fileops.c:745
#7 0x00007fd50fc94835 in _IO_new_file_xsputn (n=7, data=<optimized out>, f=<optimized out>)
at libioP.h:948
#8 _IO_new_file_xsputn (f=0x7fd50fdee6a0 <_IO_2_1_stdout_>, data=<optimized out>, n=7)
at fileops.c:1197
#9 0x00007fd50fc7baf2 in __vfprintf_internal (s=0x7fd50fdee6a0 <_IO_2_1_stdout_>,
format=0x5622fd1b8010 "DEBUG: %s():\n", ap=ap#entry=0x7ffec28308e0,
mode_flags=mode_flags#entry=0) at ../libio/libioP.h:948
#10 0x00007fd50fc66ebf in __printf (format=<optimized out>) at printf.c:33
#11 0x00005622fd1b53eb in fast_malloc (num_bytes=1024) at src/fast_malloc.c:225
#12 0x00005622fd1b5b66 in malloc (num_bytes=1024) at src/fast_malloc.c:496
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) q
Another one:
Reading symbols from build/fast_malloc_unit_tests...
warning: core file may not match specified executable file.
[New LWP 1257787]
Core was generated by `build/fast_malloc_unit_tests'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f20b0bbba80 in __find_specmb (
format=0x5644c516d108 "DEBUG: block_map_i = %zu (num_bytes requested to allocate = %zu; smallest user block size large enough = %zu)\n") at printf-parse.h:111
111 printf-parse.h: No such file or directory.
(gdb) bt
#0 0x00007f20b0bbba80 in __find_specmb (
format=0x5644c516d108 "DEBUG: block_map_i = %zu (num_bytes requested to allocate = %zu; smallest user block size large enough = %zu)\n") at printf-parse.h:111
#1 __vfprintf_internal (s=0x7f20b0d2e6a0 <_IO_2_1_stdout_>,
format=0x5644c516d108 "DEBUG: block_map_i = %zu (num_bytes requested to allocate = %zu; smallest user block size large enough = %zu)\n", ap=ap#entry=0x7ffe7f6ea580, mode_flags=mode_flags#entry=0)
at vfprintf-internal.c:1365
#2 0x00007f20b0ba6ebf in __printf (format=<optimized out>) at printf.c:33
#3 0x00005644c516a47d in fast_malloc (num_bytes=1024) at src/fast_malloc.c:244
#4 0x00005644c516ab4e in malloc (num_bytes=1024) at src/fast_malloc.c:496
#5 0x00007f20b0bc6e84 in __GI__IO_file_doallocate (fp=0x7f20b0d2e6a0 <_IO_2_1_stdout_>)
at filedoalloc.c:101
#6 0x00007f20b0bd7050 in __GI__IO_doallocbuf (fp=fp#entry=0x7f20b0d2e6a0 <_IO_2_1_stdout_>)
at libioP.h:948
#7 0x00007f20b0bd60b0 in _IO_new_file_overflow (f=0x7f20b0d2e6a0 <_IO_2_1_stdout_>, ch=-1)
at fileops.c:745
#8 0x00007f20b0bd4835 in _IO_new_file_xsputn (n=23, data=<optimized out>, f=<optimized out>)
at libioP.h:948
#9 _IO_new_file_xsputn (f=0x7f20b0d2e6a0 <_IO_2_1_stdout_>, data=<optimized out>, n=23)
at fileops.c:1197
#10 0x00007f20b0bbbaf2 in __vfprintf_internal (s=0x7f20b0d2e6a0 <_IO_2_1_stdout_>,
format=0x5644c516d108 "DEBUG: block_map_i = %zu (num_bytes requested to allocate = %zu; smallest--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) q
another:
Reading symbols from build/fast_malloc_unit_tests...
warning: core file may not match specified executable file.
[New LWP 1258037]
Core was generated by `build/fast_malloc_unit_tests'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f901ef65e4d in __GI__IO_file_doallocate (fp=0x7f901f0cd6a0 <_IO_2_1_stdout_>)
at libioP.h:948
948 libioP.h: No such file or directory.
(gdb) q
another
Reading symbols from build/fast_malloc_unit_tests...
warning: core file may not match specified executable file.
[New LWP 1258336]
Core was generated by `build/fast_malloc_unit_tests'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f5e4b551a80 in __find_specmb (
format=0x562fac6d7108 "DEBUG: block_map_i = %zu (num_bytes requested to allocate = %zu; smallest user block size large enough = %zu)\n") at printf-parse.h:111
111 printf-parse.h: No such file or directory.
(gdb) q
My gcc build options at the moment:
-Wall -Wextra -Werror -O0 -ggdb -std=c11 -save-temps=obj -DDEBUG
Possibly related to this DEBUG_PRINTF() macro I have, which I call inside fast_malloc().
#ifdef DEBUG
/// Debug printf function.
/// See: https://stackoverflow.com/a/1941336/4561887
#define DEBUG_PRINTF(...) printf("DEBUG: "__VA_ARGS__)
#else
#define DEBUG_PRINTF(...) \
do \
{ \
} while (0)
#endif
Why is malloc() getting called before the program starts anyway? I don't call it anywhere. But, notice you can see malloc() getting called with 1024 bytes as visible in the stack traces in runs 1 and 2 (though it happens every run, those are the ones I have pasted enough you can see it in).
My malloc() and free() overrides look like this:
inline void* malloc(size_t num_bytes)
{
return fast_malloc(num_bytes);
}
inline void free(void* ptr)
{
fast_free(ptr);
}
Is my single-threaded program where malloc() is mysteriously getting called without me calling it somehow multi-threaded at startup? Does some weird program initialization stuff take place? My fast_malloc() implementation is currently NOT thread safe, so if Linux is doing some weird multi-threaded malloc() calls during some kind of program initialization or something, that could be the cause of the corruption, as again, fast_malloc(), which overrides malloc(), is NOT yet threadsafe.
It seems to be related to printing inside malloc(). Is printing inside malloc() forbidden?
Here is the bottom (first call is at very bottom) of a recent stack trace from a core dump:
#127471 0x00005626d43dca28 in malloc (num_bytes=1024) at src/fast_malloc.c:494
#127472 0x00007faa222a7e84 in __GI__IO_file_doallocate (fp=0x7faa2240f6a0 <_IO_2_1_stdout_>) at filedoalloc.c:101
#127473 0x00007faa222b8050 in __GI__IO_doallocbuf (fp=fp#entry=0x7faa2240f6a0 <_IO_2_1_stdout_>) at libioP.h:948
#127474 0x00007faa222b70b0 in _IO_new_file_overflow (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#127475 0x00007faa222b5835 in _IO_new_file_xsputn (n=13, data=<optimized out>, f=<optimized out>) at libioP.h:948
#127476 _IO_new_file_xsputn (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, data=<optimized out>, n=13) at fileops.c:1197
#127477 0x00007faa222aa678 in __GI__IO_puts (str=0x5626d43df227 '=' <repeats 13 times>) at libioP.h:948
#127478 0x00005626d43dca28 in malloc (num_bytes=1024) at src/fast_malloc.c:494
#127479 0x00007faa222a7e84 in __GI__IO_file_doallocate (fp=0x7faa2240f6a0 <_IO_2_1_stdout_>) at filedoalloc.c:101
#127480 0x00007faa222b8050 in __GI__IO_doallocbuf (fp=fp#entry=0x7faa2240f6a0 <_IO_2_1_stdout_>) at libioP.h:948
#127481 0x00007faa222b70b0 in _IO_new_file_overflow (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#127482 0x00007faa222b5835 in _IO_new_file_xsputn (n=13, data=<optimized out>, f=<optimized out>) at libioP.h:948
#127483 _IO_new_file_xsputn (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, data=<optimized out>, n=13) at fileops.c:1197
#127484 0x00007faa222aa678 in __GI__IO_puts (str=0x5626d43df227 '=' <repeats 13 times>) at libioP.h:948
#127485 0x00005626d43dca28 in malloc (num_bytes=1024) at src/fast_malloc.c:494
#127486 0x00007faa222a7e84 in __GI__IO_file_doallocate (fp=0x7faa2240f6a0 <_IO_2_1_stdout_>) at filedoalloc.c:101
#127487 0x00007faa222b8050 in __GI__IO_doallocbuf (fp=fp#entry=0x7faa2240f6a0 <_IO_2_1_stdout_>) at libioP.h:948
#127488 0x00007faa222b70b0 in _IO_new_file_overflow (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#127489 0x00007faa222b5835 in _IO_new_file_xsputn (n=49, data=<optimized out>, f=<optimized out>) at libioP.h:948
#127490 _IO_new_file_xsputn (f=0x7faa2240f6a0 <_IO_2_1_stdout_>, data=<optimized out>, n=49) at fileops.c:1197
#127491 0x00007faa222aa678 in __GI__IO_puts (str=0x5626d43df238 "Running UNIT tests for the \"fast_malloc\" module.\n") at libioP.h:948
#127492 0x00005626d43dca98 in main () at src/fast_malloc_unit_tests.c:35
(gdb)
What are __GI__IO_puts and _IO_new_file_xsputn and those other function calls as you move up? Are they calls in other threads? Are they calling malloc() behind-the-scenes? It appears __GI__IO_file_doallocate is...
You are calling printf within your malloc implementation. That is not going to end well.
In the stack trace, you can clearly see that printf itself calls malloc.
If your malloc is not prepared to to be called while in the middle of manipulating its data structures, it will crash (possibly that's what's happening here).
Alternatively, you can also end up with infinite recursion, when malloc calls printf, which calls malloc, which calls printf, etc.
TL;DR: when implementing something as low level as malloc, you must stick to either low-level functions which don't themselves allocate anything, or to direct system calls.
Why is malloc() getting called before the program starts anyway?
Because low-level functions in e.g. dynamic loader need to allocate memory during their own initialization.
Your malloc must work very early in the process lifetime; long before main.
Is printing inside malloc() forbidden?
Everything that might allocate memory is forbidden.
In practice, you need to call only async-signal safe routines, because non-async-signal safe ones may allocate, if not now then in the future.
To follow up and answer my own question: #Employed Russian's answer appears to be correct.
To be more-specific: I have two main problems:
Infinite recursion between malloc() and printf().
Data corruption by freeing and reusing memory the system thinks it has exclusive access to.
The 1st problem: infinite recursion
I call printf() to do some debug prints inside my fast_malloc() implementation. So long as I do NOT override malloc() with my fast_malloc(), this is fine (so long as I protect the print with a mutex to make it multi-threaded-safe). BUT, once I do override malloc() with my fast_malloc(), this is NOT fine, because printf() calls malloc() to create a buffer into which it can place formatted string data. So, once malloc() becomes overridden by fast_malloc(), we end up with infinite recursion: prior to main() even being run, the system calls malloc() to prepare some things. This calls printf(), which calls malloc(), which calls printf()...forever until stack overflow...all before it has even entered my main() function.
So, I see zero of my prints, and main() doesn't even get entered. You can see from my last stack trace I posted in my answer that I had 127492 stack frames on my stack at the time of the crash...at which point the stack overflowed. Sanity check: for a stack size of ~7.4 MB, that equates to about 7400000/127492 = ~58 bytes per stack frame, which seems reasonable.
The 2nd problem: I'm freeing and reusing memory that the system (glibc) thinks it has safely acquired and still controls
The code I'm running is my fast_malloc_unit_tests.c program, which, among other things, re-initializes the memory pools I'm using under-the-hood many times. Each time it does this, it considers prior-allocated memory to be freed, and it reallocates it when needed. BUT, printf() and other system calls run prior to main() even being entered have already called malloc() and think they still own this memory. So, we end up with me mistakenly reusing the memory they are using, causing data corruption and crashes.
After disabling all prints inside my malloc() implementation, thereby removing the infinite recursion problem, I was able to see this behavior. In this case, the code did enter my main() function, I did see up to a few dozen of my prints before the crash, and there were only 2 calls (stack frames) on my stack at the time of the crash (rather than 127492 frames). They were:
#0 0x000055555555589d in fast_malloc_print_stats () at src/fast_malloc.c:464
#1 0x0000555555556228 in main () at src/fast_malloc_unit_tests.c:129
Full output:
Program received signal SIGSEGV, Segmentation fault.
0x000055555555589d in fast_malloc_print_stats () at src/fast_malloc.c:464
464 block = block->next_free_block;
(gdb) bt
#0 0x000055555555589d in fast_malloc_print_stats () at src/fast_malloc.c:464
#1 0x0000555555556228 in main () at src/fast_malloc_unit_tests.c:129
where fast_malloc.c line 464 contains:
while (block != NULL)
{
free_block_cnt_walked++;
block = block->next_free_block; <==== line 464
}
which as far as I can tell has nothing wrong whatsoever, as it's a simple copy and block was already guaranteed NOT to be NULL, so calling block->next_free_block couldn't possibly be dereferencing a NULL ptr. I think the segmentation fault must therefore be due to corrupted memory because that memory is being double-used, so the block ptr probably is a corrupted address which is outside the valid bounds for us to read--hence the seg fault.
That's it (I think). Now I've got to go do proper fixes and continue work on this. Big thanks goes out to #Employed Russian!
See also:
[my answer: a safe_printf() function which never calls malloc(), thereby solving the infinite recursion problem!] Which print calls in C do NOT ever call malloc() under the hood?
/var/log/message:
segfault at 0 ip 00007fcd16e5853a sp 00007ffd98e37e58 error 4 in libc-2.24.so[7fcd16dc9000+195000]
addr2line -e a.out 00007fcd16e5853a
??:0
gdb bt
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fcd16e5853a in ?? ()
(gdb) bt
#0 0x00007fcd16e5853a in ?? ()
#1 0x000055f2f45fe95b in ?? ()
#2 0x000055f200000080 in ?? ()
#3 0x00007fcd068c2040 in ?? ()
#4 0x000055f2f6109c48 in ?? ()
#5 0x0000000000000000 in ?? ()
build with gcc -Wall -O0 -g
How can I debug this, are there more methods?
gdb bt
Surely that is not the command you actually executed.
Most likely you did something like this:
gdb /path/to/core
(gdb) bt
Don't do that. Do this instead:
gdb /path/to/a.out /path/to/core
(gdb) bt
If you already did invoke GDB correctly, other likely reasons why bt did not work:
You are analyzing the core on a different machine from the one on which it was produced. See this answer.
You rebuilt a.out with different flags. Use the exact binary that crashed.
You have updated libc after the core was produced. Restore it to the version that was current as of when the core was produced.
P.S. This command
addr2line -e a.out 00007fcd16e5853a
makes no sense: the error message told you that the address 00007fcd16e5853a is in libc-2.24.so. The a.out has nothing to do with that address.
The command you want to use is:
addr2line -fe /path/to/libc-2.24.so 195000
P.P.S.
segfault at 0 ip 00007fcd16e5853a ...
This means: NULL pointer dereference inside libc. The most probable cause: not checking for error return, e.g. something like:
FILE *fp = fopen("/some/file", "r");
fscanf(fp, buffer, sizeof(buffer)); // Oops: didn't check for NULL.
Suppose I have:
#include <stdlib.h>
int main()
{
int a = 2, b = 3;
if (a!=b)
abort();
}
Compiled with:
gcc -g c.c
Running this, I'll get a coredump (due to the SIGABRT raised by abort()), which I can debug with:
gdb a.out core
How can I get gdb to print the values of a and b from this context?
Here's the another way to specifically get a and b values by moving to the interested frame and then info locals would give you the values.
a.out was compiled with your code. (frame 2 is what you are interested in i.e., main()).
$ gdb ./a.out core
[ removed some not-so-interesting info here ]
Reading symbols from ./a.out...done.
[New LWP 14732]
Core was generated by `./a.out'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fac16269f5d in __GI_abort () at abort.c:90
#2 0x00005592862f266d in main () at f.c:7
(gdb) frame 2
#2 0x00005592862f266d in main () at f.c:7
7 abort();
(gdb) info locals
a = 2
b = 3
(gdb) q
You can also use print once frame 2:
(gdb) print a
$1 = 2
(gdb) print b
$2 = 3
Did you compile with debug symbols -g? The command should be bt for backtrace, you can also use bt full for a full backtrace.
More infos: https://sourceware.org/gdb/onlinedocs/gdb/Backtrace.html
I'm trying to invoke gdb with a stripped executable and a separate debug symbols file, on a core dump generated from running the stripped executable.
But when I use the separate debug symbols file, gdb is unable to give information on local variables for me.
Here is a log showing entirely how I produce my 3 ELF files and the core file and then run them through gdb 3 times.
First I just run gdb with the stripped executable and of course can't see any file names or line numbers, and can't inspect variables.
Then I run gdb using the stripped executable and grabbing the debug symbols from the original unstripped executable. This works pretty well but does give a disturbing and apparently unwarranted warning about the core and executable possibly mismatching.
Finally I run gdb with the stripped executable and the separate debug file. This still gives filenames and line numbers, but I can't inspect local variables and I get a "can't compute CFA for this frame" error.
Here is the log:
2016-09-16 16:01:45 barry#somehost ~/proj/segfault/segfault
$ cat segfault.c
#include <stdio.h>
int main(int argc, char **argv) {
char *badpointer = (char *)0x2398723;
printf("badpointer: %s\n", badpointer);
return 0;
}
2016-09-16 16:03:31 barry#somehost ~/proj/segfault/segfault
$ gcc -g -o segfault segfault.c
2016-09-16 16:03:37 barry#somehost ~/proj/segfault/segfault
$ objcopy --strip-debug segfault segfault.stripped
2016-09-16 16:03:40 barry#somehost ~/proj/segfault/segfault
$ objcopy --only-keep-debug segfault segfault.debug
2016-09-16 16:03:43 barry#somehost ~/proj/segfault/segfault
$ ./segfault.stripped
Segmentation fault (core dumped)
2016-09-16 16:03:48 barry#somehost ~/proj/segfault/segfault
$ ll /tmp/core.segfault.stripp.11
-rw------- 1 barry bsm-it 188416 2016-09-16 16:03 /tmp/core.segfault.stripp.11
2016-09-16 16:03:51 barry#somehost ~/proj/segfault/segfault
$ gdb ./segfault.stripped /tmp/core.segfault.stripp.11
GNU gdb (GDB) Fedora (7.0.1-50.fc12)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/barry/proj/segfault/segfault/segfault.stripped...(no debugging symbols found)...done.
warning: core file may not match specified executable file.
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a6/8dce9115a92508af92ac4ccac24b9f0cc34d71
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./segfault.stripped'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.11.2-3.x86_64
(gdb) bt
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
#2 0x00000000004004f4 in main ()
(gdb) up
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
(gdb) up
#2 0x00000000004004f4 in main ()
(gdb) p argc
No symbol table is loaded. Use the "file" command.
(gdb) q
2016-09-16 16:04:19 barry#somehost ~/proj/segfault/segfault
$ gdb -q -e ./segfault.stripped -s ./segfault -c /tmp/core.segfault.stripp.11
Reading symbols from /home/barry/proj/segfault/segfault/segfault...done.
warning: core file may not match specified executable file.
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a6/8dce9115a92508af92ac4ccac24b9f0cc34d71
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./segfault.stripped'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.11.2-3.x86_64
(gdb) bt
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
#2 0x00000000004004f4 in main (argc=1, argv=0x7fffd1c0a728) at segfault.c:4
(gdb) up
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
(gdb) up
#2 0x00000000004004f4 in main (argc=1, argv=0x7fffd1c0a728) at segfault.c:4
4 printf("badpointer: %s\n", badpointer);
(gdb) p argc
$1 = 1
(gdb) q
2016-09-16 16:04:39 barry#somehost ~/proj/segfault/segfault
$ gdb -q -e ./segfault.stripped -s ./segfault.debug -c /tmp/core.segfault.stripp.11
Reading symbols from /home/barry/proj/segfault/segfault/segfault.debug...done.
warning: core file may not match specified executable file.
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a6/8dce9115a92508af92ac4ccac24b9f0cc34d71
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./segfault.stripped'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.11.2-3.x86_64
(gdb) bt
#0 0x00000035fec47cb7 in vfprintf () from /lib64/libc.so.6
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
#2 0x00000000004004f4 in main (argc=can't compute CFA for this frame
) at segfault.c:4
(gdb) up
#1 0x00000035fec4ec4a in printf () from /lib64/libc.so.6
(gdb) up
#2 0x00000000004004f4 in main (argc=can't compute CFA for this frame
) at segfault.c:4
4 printf("badpointer: %s\n", badpointer);
(gdb) p argc
can't compute CFA for this frame
(gdb) q
I have some questions about this:
Why does it display the warning "warning: core file may not match specified executable file.", even though I'm using the exact same executable path as was used when the core dump was originally generated?
Why does using the separate debug symbols (-s ./segfault.debug) result in the error "can't compute CFA for this frame" when attempting to inspect local variables?
What is a CFA anyway?
Am I using an incorrect method to product the debug symbol file?
I confirmed that using "objcopy --strip-debug" gives the same result as "strip -g".
Am I using the right options to feed the debug info into gdb?
My intention is that the stripped executables will be installed on a binary-compatible production system and any core dumps generated due to segfaults can be copied back to the devel system where we can feed them into gdb with the debug info and analyse the crash position and stack variables. But as a first step I'm trying to sort out the issues with using separate debug info files on the devel system.
It seems that using a separate debug symbols file causes the "can't compute CFA for this frame" error, even when a core file is not used.
My gcc version:
2016-09-16 16:07:39 barry#somehost ~/proj/segfault/segfault
$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC)
I suspect that gdb might be looking for symbols related to the variables in the segfault.debug file when objcopy actually only put them in the segfault.stripped file. If this is the case, perhaps some small adjustment to the options to objcopy could put those symbols in the place gdb is looking?
I commend you for wanting to keep a set of symbol files for everything that is deployed to the production server; in my opinion this is an often overlooked practice, but you will not regret it -- one day it will save you a lot of debugging trouble.
As I have had similar issues in the past, I will try to answer some of your questions, although you have quite an ancient toolchain, if you don't mind me saying so, so I'm not sure how much that really applies here. I'll put up here anyway.
CFA = Canonical Frame Address. This is the base pointer to the stack frame that every local variable is addressed relative to. If you have done some traditional x86 assembly programming, the BP register was used for this. So "can't compute CFA for this frame" basically says "I know of these local variables, but I don't know where they are located on the stack".
There used to be code in GDB that worked only for the DWARF-2 debugging format, and non-conformance triggered this particular error at least. That restriction was lifted some time ago, but that change won't be in your version.
The other thing is there are debug information regarding how variables may be moved around is not always generated. This usually happens in newer compilers though, as they get better at optimizing.
I was able to get rid of my problems by compiling like this:
gcc -g3 -gdwarf-2 -fvar-tracking -fvar-tracking-assignments -o segfault segfault.c
you can try to see if this solves your problem, too.
Regarding the message about the location of the symbol file; it seems that the debugger wants to load it from the system directory. Maybe you have to link the executable to the symbol file with:
objcopy --add-gnu-debuglink=segfault.debug segfault
I found this question while searching for an answer to the following part of the original question:
Why does it display the warning "warning: core file may not match
specified executable file.", even though I'm using the exact same
executable path as was used when the core dump was originally
generated?
There was not an answer to this particular question but through experimentation and research I believe I have found the answer.
Below is a transcript of using gdb to debug a core file. Notice that the "warning: core file may not match specified executable file." error appears when the executable file that caused the core is greater than 15 characters in length.
[~/t]$cat do_abort.c
#include <stdlib.h>
int func4(int f) { if(f) {abort();} return 0;}
int func3(int f) { return func4(f); }
int func2(int f) { return func3(f); }
int func1(int f) { return func2(f); }
int main(void) { return func1(1); }
[~/t]$gcc -g -o 123456789012345 do_abort.c
[~/t]$./123456789012345
Aborted (core dumped)
[~/t]$ll core*
-rw-------. 1 dev wheel 240K Apr 22 03:19 core.42697
[~/t]$gdb -q -c core.42697 123456789012345
Reading symbols from /home/dev/t/123456789012345...done.
[New LWP 42697]
Core was generated by `./123456789012345'.
Program terminated with signal 6, Aborted.
#0 0x00007f0be67631d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007f0be67631d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f0be67648c8 in __GI_abort () at abort.c:90
#2 0x0000000000400543 in func4 (f=1) at do_abort.c:3
#3 0x000000000040055f in func3 (f=1) at do_abort.c:4
#4 0x0000000000400576 in func2 (f=1) at do_abort.c:5
#5 0x000000000040058d in func1 (f=1) at do_abort.c:6
#6 0x000000000040059d in main () at do_abort.c:7
(gdb) quit
[~/t]$rm core.42697
[~/t]$
[~/t]$mv 123456789012345 1234567890123456
[~/t]$./1234567890123456
Aborted (core dumped)
[~/t]$ll core*
-rw-------. 1 dev wheel 240K Apr 22 03:20 core.42721
[~/t]$gdb -q -c core.42721 1234567890123456
Reading symbols from /home/dev/t/1234567890123456...done.
warning: core file may not match specified executable file.
[New LWP 42721]
Core was generated by `./1234567890123456'.
Program terminated with signal 6, Aborted.
#0 0x00007f5b271fa1d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007f5b271fa1d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f5b271fb8c8 in __GI_abort () at abort.c:90
#2 0x0000000000400543 in func4 (f=1) at do_abort.c:3
#3 0x000000000040055f in func3 (f=1) at do_abort.c:4
#4 0x0000000000400576 in func2 (f=1) at do_abort.c:5
#5 0x000000000040058d in func1 (f=1) at do_abort.c:6
#6 0x000000000040059d in main () at do_abort.c:7
(gdb) quit
[~/t]$mv 1234567890123456 123456789012345
[~/t]$gdb -q -c core.42721 123456789012345
Reading symbols from /home/dev/t/123456789012345...done.
[New LWP 42721]
Core was generated by `./1234567890123456'.
Program terminated with signal 6, Aborted.
#0 0x00007f5b271fa1d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007f5b271fa1d7 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f5b271fb8c8 in __GI_abort () at abort.c:90
#2 0x0000000000400543 in func4 (f=1) at do_abort.c:3
#3 0x000000000040055f in func3 (f=1) at do_abort.c:4
#4 0x0000000000400576 in func2 (f=1) at do_abort.c:5
#5 0x000000000040058d in func1 (f=1) at do_abort.c:6
#6 0x000000000040059d in main () at do_abort.c:7
(gdb) quit
Following through the gdb source code I discovered that the ELF core file structure only reserves sixteen bytes to hold the executable filename, pr_fname[16], including the nul terminator (reference):
35 struct elf_external_linux_prpsinfo32_ugid32
36 {
37 char pr_state; /* Numeric process state. */
38 char pr_sname; /* Char for pr_state. */
39 char pr_zomb; /* Zombie. */
40 char pr_nice; /* Nice val. */
41 char pr_flag[4]; /* Flags. */
42 char pr_uid[4];
43 char pr_gid[4];
44 char pr_pid[4];
45 char pr_ppid[4];
46 char pr_pgrp[4];
47 char pr_sid[4];
48 char pr_fname[16]; /* Filename of executable. */
49 char pr_psargs[80]; /* Initial part of arg list. */
50 };
The "warning: core file may not match specified executable file." warning will be issued by gdb when the name of the executable passed on the command-line to gdb doesn't match the value stored in pr_fname[] in the core file (references here, here, and here).
Using the demonstration I showed at the start of this answer, when the filename is 1234567890123456 the filename stored in the core file as pr_fname[] is 123456789012345 (truncated to 15 characters). If gdb is started using gdb -c core.XXXX 1234567890123456 then the warning will be issued. If gdb is started using gdb -c core.XXXX 123456789012345 then the warning will not be issued.
It should follow that in the example from the original question, if segfault.stripped was renamed to segfault.stripp and gdb was run using gdb ./segfault.stripp /tmp/core.segfault.stripp.11 then the warning should not be issued.
I have a C program that quits unexpectedly on Linux and I have a hard time finding out why (no core dump, see XIO: fatal IO error 11). I placed an atexit() at the beginning of the program and the callback function is indeed being called when the crash happens.
How can I know what called the atexit callback function? From reading the man page, atexit is called at exit (d'ho!) or return from main. I can exclude the latter because there are a bunch of printf at the end of the main and I don't see them. And I can exclude the former simply because there aren't any exit() in my program.
That leaves only one solution: exit is being called from a library function. Is that the only possibility? And how can I know from where? Is it possible to print out a stack trace or force a core dump from inside the atexit callback?
Call e.g. abort() in your atexit handler, and inspect the coredump in gdb. The gdb backtrace command shows you where it exits, if the atexit handler is run. Here's a demonstration:
#include <stdlib.h>
void exit_handler(void)
{
abort();
}
void startup()
{
#ifdef DO_EXIT
exit(99);
#endif
}
int main(int argc, char *argv[])
{
atexit(exit_handler);
startup();
return 0;
}
And doing this:
$ gcc -DDO_EXIT -g atexit.c
$ ulimit -c unlimited
$ ./a.out
Aborted (core dumped)
$ gdb ./a.out core.28162
GNU gdb (GDB) Fedora 7.7.1-19.fc20
..
Core was generated by `./a.out'.
Program terminated with signal SIGABRT, Aborted.
#0 0xb77d7424 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686
(gdb) bt
#0 0xb77d7424 in __kernel_vsyscall ()
#1 0x42e1a8e7 in raise () from /lib/libc.so.6
#2 0x42e1c123 in abort () from /lib/libc.so.6
#3 0x0804851b in exit_handler () at atexit.c:6
#4 0x42e1dd61 in __run_exit_handlers () from /lib/libc.so.6
#5 0x42e1ddbd in exit () from /lib/libc.so.6
#6 0x0804852d in startup () at atexit.c:12
#7 0x08048547 in main (argc=1, argv=0xbfc39fb4) at atexit.c:21
As expected, it shows startup() calling exit.
You can ofcourse debug this interactively too, start your program in gdb and set a breakpoint in the atexit handler.
The standard only says "at normal program termination", so maybe on Linux this is more than exit or return from main. Also you forgot pthread_exit, which also may terminate the thread of main and thus the whole program.
In any case, there is no way to see immediatly from where the termination was issued. The atexit handlers are usually called by the initializtion function. By definition all other application code, but the atexit handlers are gone at that point.
You could try to trace execution through a debugger no nail the place where the termination happens down.