Fixing AddressSanitizer: strcpy-param-overlap with memmove? - c

I am poking around in an old & quite buggy C program. When compiled with gcc -fsanitize=address I got this error while running the program itself:
==635==ERROR: AddressSanitizer: strcpy-param-overlap: memory ranges [0x7f37e8cfd5b5,0x7f37e8cfd5b8) and [0x7f37e8cfd5b5, 0x7f37e8cfd5b8) overlap
#0 0x7f390c3a8552 in __interceptor_strcpy /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cc:429
#1 0x56488e5c1a08 in backupExon src/BackupGenes.c:72
#2 0x56488e5c2df1 in backupGene src/BackupGenes.c:134
#3 0x56488e5c426e in BackupArrayD src/BackupGenes.c:227
#4 0x56488e5c0bb1 in main src/geneid.c:583
#5 0x7f390b6bfee2 in __libc_start_main (/usr/lib/libc.so.6+0x26ee2)
#6 0x56488e5bf46d in _start (/home/darked89/proj_soft/geneidc/crg_github/geneidc/bin/geneid+0x1c46d)
0x7f37e8cfd5b5 is located 3874229 bytes inside of 37337552-byte region [0x7f37e894b800,0x7f37eace71d0)
allocated by thread T0 here:
#0 0x7f390c41bce8 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:153
#1 0x56488e618728 in RequestMemoryDumpster src/RequestMemory.c:801
#2 0x56488e5bfcea in main src/geneid.c:305
#3 0x7f390b6bfee2 in __libc_start_main (/usr/lib/libc.so.6+0x26ee2)
The error was caused by this line:
/* backupExon src/BackupGenes.c:65 */
strcpy(d->dumpSites[d->ndumpSites].subtype, E->Acceptor->subtype);
I have replaced it with:
memmove(d->dumpSites[d->ndumpSites].subtype, E->Acceptor->subtype,
strlen(d->dumpSites[d->ndumpSites].subtype));
The error went away and the program output produced with 2 different data inputs is identical to the results obtained before the change. BTW, more of strcpy bugs remain further down in the source. I need a confirmation that this is the way to fix it.
The issue & the rest of the code is here:
https://github.com/darked89/geneidc/issues/2

Assuming that E->Acceptor->subtype is at least as long as d->dumpSites[d->ndumpSites].subtype then there's no problem. You might want to check that first if you didn't already. Actually, you need a +1 to also copy the string terminator (\0), thanks #R.. for spotting it.
Your previous code was making a different assumption: it was assuming that d->dumpSites[d->ndumpSites].subtype was at least as long as E->Acceptor->subtype (the opposite basically).
The real equivalent would be:
memmove(
d->dumpSites[d->ndumpSites].subtype,
E->Acceptor->subtype,
strlen(E->Acceptor->subtype) + 1
);
This is the correct way to fix the code to allow overlapping.

Related

I was expecting segmentation fault or some kind of out of bound exception but did not get it when using command line arguments in a C program

I am learning C programming from "Learn c the hard way by Zed Shaw". He asks the learner to try and break their own code.
So I tried the following C code and thought printing more values that I gave argv will break it but it did not until later.
#include<stdio.h>
int main(int argc, char *argv[])
{
int i = 0;
printf("This is argc: %d\n",argc);
printf("This is argv[argc]: %s\n",argv[argc]);
printf("This is argv[0]: %s\n",argv[0]);
for(i=argc;i<100;i++)
printf("This is argv[%d]: %s\n",i,argv[i]);
for(i=1;i<argc;i++)
{
printf("arg %d: %s\n",i,argv[i]);
}
return 0;
}
When I try to print argv upto 100:
I see the following when I was expecting some kind of out of bound or segmentation fault.
./exp10_so These are cmd args
This is argc: 5
This is argv[argc]: (null)
This is argv[0]: ./exp10_so
This is argv[5]: (null)
This is argv[6]: TERMINATOR_DBUS_NAME=net.tenshu.Terminator21a9d5db22c73a993ff0b42f64b396873
This is argv[7]: GTK_RC_FILES=/etc/gtk/gtkrc:/home/ab/.gtkrc:/home/ab/.config/gtkrc
This is argv[8]: _=/home/ab/Projects/learn_c_the_hard_way/./exp10_so
This is argv[9]: LANG=en_IN
This is argv[10]: GTK3_MODULES=xapp-gtk3-module
This is argv[11]: XDG_CURRENT_DESKTOP=KDE
This is argv[12]: QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
This is argv[13]: LC_IDENTIFICATION=en_IN
This is argv[14]: XCURSOR_THEME=breeze_cursors
This is argv[15]: XDG_SESSION_CLASS=user
This is argv[16]: XDG_SESSION_TYPE=x11
This is argv[17]: SHLVL=1
This is argv[18]: TERMINATOR_UUID=urn:uuid:4496f24b-8a64-43af-ab5a-03fc7e722242
This is argv[19]: DESKTOP_SESSION=plasma
This is argv[20]: LC_MEASUREMENT=en_IN
This is argv[21]: OLDPWD=/home/ab/Projects
This is argv[22]: HOME=/home/ab
This is argv[23]: KDE_SESSION_VERSION=5
This is argv[24]: USER=ab
This is argv[25]: TERMINATOR_DBUS_PATH=/net/tenshu/Terminator2
This is argv[26]: SESSION_MANAGER=local/tgh:#/tmp/.ICE-unix/2372,unix/tgh:/tmp/.ICE-unix/2372
This is argv[27]: XDG_SESSION_PATH=/org/freedesktop/DisplayManager/Session1
This is argv[28]: DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
This is argv[29]: XDG_VTNR=1
This is argv[30]: XDG_SEAT=seat0
This is argv[31]: LC_NUMERIC=en_IN
This is argv[32]: BROWSER=/usr/bin/firefox
This is argv[33]: GTK_MODULES=canberra-gtk-module
This is argv[34]: XDG_SEAT_PATH=/org/freedesktop/DisplayManager/Seat0
This is argv[35]: XDG_DATA_DIRS=/home/ab/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share:/var/lib/snapd/desktop
This is argv[36]: XDG_SESSION_DESKTOP=KDE
This is argv[37]: VTE_VERSION=6401
This is argv[38]: KDE_SESSION_UID=1000
This is argv[39]: LC_TIME=en_IN
This is argv[40]: MAIL=/var/spool/mail/ab
This is argv[41]: LOGNAME=ab
This is argv[42]: QT_AUTO_SCREEN_SCALE_FACTOR=0
This is argv[43]: LC_PAPER=en_IN
This is argv[44]: PATH=/usr/local/nginx/sbin:/home/ab/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/var/lib/snapd/snap/bin
This is argv[45]: QT_SCREEN_SCALE_FACTORS=LVDS1=1;DP1=1;HDMI1=1;VGA1=1;VIRTUAL1=1;
This is argv[46]: XDG_RUNTIME_DIR=/run/user/1000
This is argv[47]: SHELL=/bin/zsh
This is argv[48]: XDG_SESSION_ID=2
This is argv[49]: LC_MONETARY=en_IN
This is argv[50]: GTK2_RC_FILES=/etc/gtk-2.0/gtkrc:/home/ab/.gtkrc-2.0:/home/ab/.config/gtkrc-2.0
This is argv[51]: LC_TELEPHONE=en_IN
This is argv[52]: EDITOR=/usr/bin/nano
This is argv[53]: COLORTERM=truecolor
This is argv[54]: MOTD_SHOWN=pam
This is argv[55]: KDE_APPLICATIONS_AS_SCOPE=1
This is argv[56]: PAM_KWALLET5_LOGIN=/run/user/1000/kwallet5.socket
This is argv[57]: KDE_FULL_SESSION=true
This is argv[58]: XAUTHORITY=/home/ab/.Xauthority
This is argv[59]: LC_NAME=en_IN
This is argv[60]: DISPLAY=:0
This is argv[61]: LC_ADDRESS=en_IN
This is argv[62]: PWD=/home/ab/Projects/learn_c_the_hard_way
This is argv[63]: XCURSOR_SIZE=24
This is argv[64]: TERM=xterm-256color
This is argv[65]: ZSH=/home/ab/.oh-my-zsh
This is argv[66]: PAGER=less
This is argv[67]: LESS=-R
This is argv[68]: LSCOLORS=Gxfxcxdxbxegedabagacad
This is argv[69]: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
This is argv[70]: LD_LIBRARY_PATH=/usr/local/lib
This is argv[71]: (null)
AddressSanitizer:DEADLYSIGNAL
=================================================================
==69851==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000021 (pc 0x7f3c30d7b4c6 bp 0x7ffe273b2ba0 sp 0x7ffe273b22e8 T0)
==69851==The signal is caused by a READ memory access.
==69851==Hint: address points to the zero page.
#0 0x7f3c30d7b4c6 in __sanitizer::internal_strlen(char const*) /build/gcc/src/gcc/libsanitizer/sanitizer_common/sanitizer_libc.cpp:167
#1 0x7f3c30d0d057 in printf_common /build/gcc/src/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors_format.inc:545
#2 0x7f3c30d0d41c in __interceptor_vprintf /build/gcc/src/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1639
#3 0x7f3c30d0d517 in __interceptor_printf /build/gcc/src/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1697
#4 0x562c5e03f290 in main /home/ab/Projects/learn_c_the_hard_way/exp10_so.c:13
#5 0x7f3c30b0ab24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
#6 0x562c5e03f0bd in _start (/home/ab/Projects/learn_c_the_hard_way/exp10_so+0x10bd)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /build/gcc/src/gcc/libsanitizer/sanitizer_common/sanitizer_libc.cpp:167 in __sanitizer::internal_strlen(char const*)
==69851==ABORTING
A segmentation fault happens when the code try to access a memory region that is not available.
Accessing an array out of bounds doesn't means that the memory before or after the area occupied by the array is not available: The compiler or the runtime usually put all varibales or data in general in a given block of memory. If your array is the last item of such a memory block, the accessing it with a to big index will produce a Segmentaion Fault but is the array is in the middle of the memory block, you will just access memory used for other data, giving unexpected result and undefined behavior.
If the array (In may example, but valid for anything) is written, accessing available memory will not produce a segmentation fault but will overwrite something else. It may produce unexpected results or crash or segmentation fault later! This kind of bug is frequently very difficult to find because the unexpected result/behavior looks completely independent of the root cause.

Jenkins hash in C, keys with size that is not multiple of 4 and address sanitizer

In the project I'm currently working on (in C), we're currently
keeping a hash table of some opaque objects. We're using the DPDK for
I/O in our app (version 16.07.2, unfortunately), and we're using the
rte_hash code for hashing our object. Trouble is, the objects we want
to hash have weird, non-rounded sizes like say 83 (or 18 as in the
example below), and address sanitizer complains about
heap-buffer-overflow (on read) - trying to read bytes after the end of
the region:
==4926==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60300007a9c0 at pc 0x000000451573 bp 0x7fff69175040 sp 0x7fff69175030
READ of size 4 at 0x60300007a9c0 thread T10ESC[1mESC[0m
#0 0x451572 in __rte_jhash_2hashes /path/to/../dpdk/usr/include/dpdk/rte_jhash.h:155
#1 0x452bb6 in rte_jhash_2hashes /path/to/../dpdk/usr/include/dpdk/rte_jhash.h:266
#2 0x452c75 in rte_jhash /path/to/../dpdk/usr/include/dpdk/rte_jhash.h:309
0x60300007a9c2 is located 0 bytes to the right of 18-byte region [0x60300007a9b0,0x603
00007a9c2)
As far as I can tell, the problem is here in rte_jhash.h (see here for
code in the latest DPDK, it's unchanged as far as I can tell:
http://dpdk.org/doc/api/rte__jhash_8h_source.html):
case 6:
b += k[1] & LOWER16b_MASK; a += k[0]; break;
The code reads k[1] as a uint32_t, and then ANDs the value so that the
last 2 bytes are discarded. As far as I can tell, address sanitizer
complains about the uint32_t read, when only the first 2 bytes are
actually marked as readable. This makes sense, but the rte_hash code
boasts that it can use keys of any size. So my question is - is this
problem theoretical only? Or would it be possible to cause a crash with
this, maybe with a weird sized object that happens to be at the end of
a page? We're running on x86-64.
A few months ago, a change in the DPDK added something in the comments about this (see http://dpdk.org/browse/dpdk/commit/lib/librte_hash?id=0c57f40e66c8c29c6c92a7b0dec46fcef5584941), but I would've expected the wording to be more harsh if a crash was possible.
UPDATE: sample code to reproduce the warning. Compile with:
gcc -o jhash_malloc -Wall -g -fsanitize=address -I /path/to/dpdk/x86_64-native-linuxapp-gcc/include/ jhash_malloc.c
And the code:
#include <stdio.h>
#include <rte_jhash.h>
#include <stdlib.h>
#include <unistd.h>
int main()
{
size_t strSize = 13;
char *str = malloc(strSize);
memset(str, 'g', strSize);
uint32_t hval = rte_jhash(str, strSize, 0);
printf("Hash of %s (size %zu) is %u\n", str, strSize, hval);
free(str);
return 0;
}
UPDATE2: And the output:
==27276==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000effc at pc 0x000000401315 bp 0x7ffdea936f80 sp 0x7ffdea936f70
READ of size 4 at 0x60200000effc thread T0
#0 0x401314 in __rte_jhash_2hashes /home/stefan/src/dpdk-17.08/x86_64-native-linuxapp-gcc/include/rte_jhash.h:165
#1 0x402771 in rte_jhash_2hashes /home/stefan/src/dpdk-17.08/x86_64-native-linuxapp-gcc/include/rte_jhash.h:266
#2 0x402830 in rte_jhash /home/stefan/src/dpdk-17.08/x86_64-native-linuxapp-gcc/include/rte_jhash.h:309
#3 0x4028e7 in main /home/stefan/src/test/misc/jhash_malloc.c:12
#4 0x7f470cb1f82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#5 0x400978 in _start (/home/stefan/src/test/misc/jhash_malloc+0x400978)
0x60200000effd is located 0 bytes to the right of 13-byte region [0x60200000eff0,0x60200000effd)
UPDATE3: the original Jenkins hash code seems to be this: http://burtleburtle.net/bob/c/lookup3.c. There is an interesting comment in the source that suggests the asan / valgrind warning can be ignored:
* "k[2]&0xffffff" actually reads beyond the end of the string, but
* then masks off the part it's not allowed to read. Because the
* string is aligned, the masked-off tail is in the same word as the
* rest of the string. Every machine with memory protection I've seen
* does it on word boundaries, so is OK with this. But VALGRIND will
* still catch it and complain. The masking trick does make the hash
* noticably faster for short strings (like English words).
Of course, if you want to hash parts of a larger object that's malloc-ed, you could still be in trouble.
You are right, if the key you are passing to rte_jhash() is happened to be at the end of the page and the next page if not readable, the app will crash. The commit you are referring to is basically fixing it, but in the documentation, not in the code.
The solution would be either to:
make sure all the keys in your code are aligned and padded to the 4 bytes; (also see notes below)
OR fix the key length in your code to be multiply of 4;
OR copy-paste rte_jhash() in your project and fix it and later send the fix to the DPDK mailing list.
Note 1: usually structures in C are already aligned and padded to the largest primitive data type of a structure. So this explicit padding should not cause any performance/memory issues, unless the structure is packed.
Note 2: if the keys are manages by DPDK library (i.e. you use DPDK Cuckoo Hash library), the storage for the keys will be aligned and padded internally, so there is nothing to be worried about.
Overall, if your keys are managed externally (i.e. by another process, or you receive those from network etc), it might be a real issue. Otherwise, there are quite easy ways to fix those...

Call stack backtrace in C

I am trying to get call stack backtrace at my assert/exception handler. Can't include "execinfo.h" therefore can't use int backtrace(void **buffer, int size);.
Also, tried to use __builtin_return_address() but acording to :http://codingrelic.geekhold.com/2009/05/pre-mortem-backtracing.html
... on some architectures, including my beloved MIPS, only __builtin_return_address(0) works.MIPS has no frame pointer, making it difficult to walk back up the stack. Frame 0 can use the return address register directly.
How can I reproduce full call stack backtrace?
I have successfully used the method described here, to get a call trace from stack on MIPS32.
You can then print out the call stack:
void *retaddrs[16];
int n, i;
n = get_call_stack_no_fp (retaddrs, 16);
printf ("CALL STACK: ");
for (i = 0; i < n; i++) {
printf ("0x%08X ", (uintptr_t)retaddrs[i]);
}
printf ("\r\n");
... and if you have the ELF file, then use the addr2line to convert the return addresses to function names:
addr2line -a -f -p -e xxxxxxx.elf addr addr ...
There are of course many gotchas, when using a method like this, including interrupts and exception handlers or results of code optimization. But nevertheless, it might be helpful sometimes.
I have successfully used the method suggested by #Erki A and described here.
Here is a short summary of the method:
The problem:
get a call stack without a frame pointer.
Solution main idea:
conclude from the assembly code what the debugger understood from debug info.
The information we need:
1. Where the return address is kept.
2. What amount the stack pointer is decremented.
To reproduce the whole stack trace one need to:
1. Get the current $sp and $ra
2. Scan towards the beginning of the function and look for "addui
sp,sp,spofft" command (spofft<0)
3. Reprodece prev. $sp (sp- spofft)
4. Scan forward and look for "sw r31,raofft(sp)"
5. Prev. return address stored at [sp+ raofft]
Above I described one iteration. You stop when the $ra is 0.
How to get the first $ra?
__builtin_return_address(0)
How to get the first $sp?
register unsigned sp asm("29");
asm("" : "=r" (sp));
***Since most of my files compiled with micro-mips optimisation I had to deal with micro-mips-ISA.
A lot of issues arose when I tried to analyze code that compiled with microMips optimization(remember that the goal at each step is to reproduce prev. ra and prev. sp):
It makes things a bit more complicated:
1. ra ($31) register contain unaligned return address.
You may find more information at Linked questions.
The unaligned ra helps you understand that you run over different
ISA(micro-mips-isa)
2. There are functions that do not move the sp. You can find more
information [here][3].
(If a "leaf" function only modifies the temporary registers and returns
to a return statement in its caller's code, then there is no need for
$ra to be changed, and there is no need for a stack frame for that
function.)
3. Functions that do not store the ra
4. MicroMips instructions can be both - 16bit and 32bit: run over the
commnds using unsinged short*.
5. There are functions that perform "addiu sp, sp, spofft" more than once
6. micro-mips-isa has couple variations for the same command
for example: addiu,addiusp.
I have decided to ignore part of the issues and that is why it works for 95% of the cases.

Modified stack in multi-threaded case

We're loading a symbol from a shared library via dlsym() under GNU/Linux and obviously get some kind of race condition resulting in a segmentation fault. The backtrace looks something like this:
(gdb) backtrace
#0 do_lookup_x at dl-lookup.c:366
#1 _dl_lookup_symbol_x at dl-lookup.c:829
#2 do_sym at dl-sym.c:168
#3 _dl_sym at dl-sym.c:273
#4 dlsym_doit at dlsym.c:50
#5 _dl_catch_error at dl-error.c:187
#6 _dlerror_run at dlerror.c:163
#7 __dlsym at dlsym.c:70
#8 ... (our code)
My local machine uses glibc-2.23.
I discovered, that the library handle given to __dlsym() in frame #7 is different to the handle passed to _dlerror_run(). It runs wild in the following lines in dlsym.c:
void *
__dlsym (void *handle, const char *name DL_CALLER_DECL)
{
# ifdef SHARED
if (__glibc_unlikely (_dlfcn_hook != NULL))
return _dlfcn_hook->dlsym (handle, name, DL_CALLER);
# endif
struct dlsym_args args;
args.who = DL_CALLER;
args.handle = handle; /* <------------------ this isn't my handle! */
args.name = name;
/* Protect against concurrent loads and unloads. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
void *result = (_dlerror_run (dlsym_doit, &args) ? NULL : args.sym);
__rtld_lock_unlock_recursive (GL(dl_load_lock));
return result;
}
GDB says
(gdb) frame 7
#7 __dlsym at dlsym.c:70
(gdb) p *(struct link_map *)args.handle
$36 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}
so this is obviously garbage. The same occurs in the higher frames, e.g. in frame #2:
(gdb) frame 2
#2 do_sym at dl-sym.c:168
(gdb) p handle
$38 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}
Unfortunately the parameter handle in frame #7 can't be displayed:
(gdb) p handle
$37 = <optimized out>
but surprisingly in frame #8 and further down in our code the handle was correct:
(gdb) frame 8
#8 ...
(gdb) p *(struct link_map *)libHandle
$38 = {l_addr = 140737160646656, l_name = 0x7fffd8005b60 "/path/to/libfoo.so", ...}
Now my conclusion is, that the variable args must be modified during the execution inside __dlsym() but I can't see where and why.
I have to confess, there's a second aspect to this problem: It only occurs in a multi-threaded environment and only sometimes. But as you can see, there are some counter measures for race conditions in the implementation of __dlsym() since they're calling __rtld_lock_(un)lock_recursive() and the local variable args isn't shared across threads. And curiously enough, the problem still persists, if I make frame #8 mutual exclusive among my threads.
Questions: What are possible sources for the discrepancy in the library handle between frame #8 and frame #7?
Question 2: Does dlopen() yield different values for different threads? Or to put it differently: Is it possible to share the handles returned by dlopen() between different threads.
Update: I thank everybody commenting on this question and trying to answer it despite the lack of almost any viable information to do so. I found the solution of this problem. As foreseen by the commenters, it was totaly unrelated to the stacktraces and other information I provided. Hence, I consider this question as closed and will flag it for deletion. So Long, and Thanks for All the Fish
What are possible sources for the discrepancy in the library handle between frame #8 and frame #7?
The most likely cause is mismatch between ld-linux.so and libdl.so. As stated in this answer, ld-linux and libdl must come from the same build of GLIBC, or bad things will happen.
The mismatch can come from (A) trying to point to a different libc build via LD_LIBRARY_PATH, or (B) by static linking of libdl.a into the program.
The (gdb) info shared should show you which libraries are currently loaded. If you see something other than installed system ld-linux and libdl, then (A) is likely your problem.
For (B), you probably got (and ignored) a linker warning to the effect that your program will require at runtime the same libc version that you used to link it. Contrary to popular belief, fully-static binaries are less portable on Linux, not more.

Debugging segfault with no apparent cause in gdb?

gdb was reporting that my C code was crashing somewhere in malloc(), so I linked my code with Electric Fence to pinpoint the actual source of the memory error. Now my code is segfaulting much earlier, but gdb's output is even more confusing:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x30026b00 (LWP 4003)]
0x10007c30 in simulated_status (axis=1, F=0x300e7fa8, B=0x1003a520, A=0x3013b000, p=0x1003b258, XS=0x3013b000)
at ccp_gch.c:799
EDIT: The full backtrace:
(gdb) bt
#0 0x10007c30 in simulated_status (axis=1, F=0x300e7fa8, B=0x1003a520, A=0x3013b000, p=0x1003b258, XS=0x3013b000)
at ccp_gch.c:799
#1 0x10007df8 in execute_QUERY (F=0x300e7fa8, B=0x1003a520, iData=0x7fb615c0) at ccp_gch.c:836
#2 0x10009680 in execute_DATA_cmd (P=0x300e7fa8, B=0x7fb615cc, R_type=0x7fb615d0, iData=0x7fb615c0)
at ccp_gch.c:1581
#3 0x10015bd8 in do_volley (client=13) at session.c:76
#4 0x10015ef4 in do_dialogue (v=12, port=2007) at session.c:149
#5 0x10016350 in do_session (starting_port=2007, ports=1) at session.c:245
#6 0x100056e4 in main (argc=2, argv=0x7fb618f4) at main.c:271
The relevant code (slightly modified due to reasons):
796 static uint32_t simulated_status(
797 unsigned axis, struct foo *F, struct bar *B, struct Axis *A, BAZ *p, uint64_t *XS)
798 {
799 uint32_t result = A->status;
800 *XS = get_status(axis);
801 if (!some_function(p)) {
802 ...
The obvious thing to check would be whether A->status is valid memory, but it is. Removing the assignment pushes the segfault to line 800, and removing that assignment causes some other assignment in the if-block to segfault. It looks as though either accessing an argument passed to the function or writing to a local variable is what's causing the segfault, but everything points to valid memory according to gdb.
How am I to interpret this? I've never seen anything like this before, so any suggestions / pointers in the right direction would be appreciated. I'm using GNU gdb 6.8-debian, Electric Fence 2.1, and running on a PowerPC 405 (uname reports Linux powerpmac 2.6.30.3 #24 [...] ppc GNU/Linux).
I'm guessing, but your symptoms are similar to what could happen in a stack overflow situation. The -fstack-protector suggestion in the comments is on the right track here. I'd recommend adding the -fstack-check option as well.
If the SEGV is occurring because of writes to the guard page protecting the stack then an info registers and info frame in gdb would help confirm if this is the case.

Resources