EXC_BAD_ACCESS (KERN_INVALID_ADDRESS) during execution of malloc() - c

I am compiling a C library in Mac OS X Snow Leopard with the folloing GCC:
Diderot:~ brandizzi$ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5666.3~6/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5666) (dot 3)
When I run some unit tests of this library (which are written on CuTest) one of the tests got a problem: an EXC_BAD_ACCESS signal. That is a usual problem and I have some understanding about this kind of issue - I'm a Linux guy who called it "Segmentation fault", understand what is happening and the usual ways to solve the problem. What is amazing is that the bad access is executed inside the execution of malloc function. Look at this backtrace I've got in GDB:
(gdb) bt
#0 0x00007fff89000a34 in tiny_free_list_add_ptr ()
#1 0x00007fff88ffe147 in tiny_malloc_from_free_list ()
#2 0x00007fff88ffcfdd in szone_malloc_should_clear ()
#3 0x00007fff88ffceaa in malloc_zone_malloc ()
#4 0x00007fff88ffb1a8 in malloc ()
#5 0x0000000100008c72 in util_copy_string (string=0x100008e48 "libsecretary") at src/util.c:7
#6 0x0000000100008126 in project_new (name=0x100008e48 "libsecretary") at src/project.c:8
#7 0x00000001000078b9 in secretary_start (secretary=0x10080b000, name=0x100008e48 "libsecretary") at src/secretary.c:23
#8 0x00000001000020f8 in test_secretary_move_task_from_project_to_project (test=0x1001005b0) at src/test/secretary.c:146
#9 0x0000000100006eae in CuTestRun (tc=0x1001005b0) at cutest/CuTest.c:143
#10 0x00000001000075c1 in CuSuiteRun (testSuite=0x100800000) at cutest/CuTest.c:289
#11 0x0000000100001527 in RunAllTests () at src/test/run_all.c:22
#12 0x000000010000156b in main () at src/test/run_all.c:32
This test case has the following lines, and the error always happens at the fourth one. If I switch the lines in any way, the problem still happens at the fourth one:
Secretary *secretary = secretary_new();
Task *task = secretary_appoint(secretary, "Test task transference");
Project *destination = secretary_start(secretary, "Chocrotary");
Project *origin = secretary_start(secretary, "libsecretary");
So, how can malloc() cause such problem? I do not even pass a pointer to it! Is it a bug? Has someone seen something like this?
Thanks in advance!

Most likely, something earlier in the program's execution is writing to memory it isn't entitled to, corrupting the heap's data structures. Then, later, malloc gets called and tries to follow a pointer that's been overwritten with nonsense (or index something via a value that's being overwritten with nonsense, or whatever), and boom.
You might want to try running your test suite under valgrind to see where things first start going wrong.

There is an awful lot of reasons for this problem: one did not allocate the memory, the pointer points to the wrong place etc. etc.
In my case, I was allocating an array of e.g. Project with MAX_PROJECT_COUNT positions. I wrote
Project *array = malloc(MAX_PROJECT_COUNT);
but it does not consider the size of the Project struct! The correct solution would be
Project *array = malloc(MAX_PROJECT_COUNT*sizeof(Project));
Note, however, that your problem may be considerably different so it is not possible to apply the same solution.

Related

Manipulating x64 Unwind Info To Match Assembly Hook

Edit: I appear to have been mistaken, the backtrace works wonderfully from anywhere on Linux -- it is only when remote debugging from gdb on ubuntu to remote windows that the stacktrace gets absolutely destroyed after entering one of the memory allocation functions in msvcrt... dammit microsoft.
And this happens for both 64bit and 32bit windows, so I'm not sure this is related to the unwind information...
Edit: It appears adding -g3 and -Og has helped with part of the issue in some programs but the problem still persists in other programs, cannot post their source here as it is IP of my company -- sorry!
Background
I am using gcc to compile ubuntu->ubuntu and mingw to compile ubuntu->windows.
I have created a cross platform (linux + windows) memory tracking & leak detection library which hooks malloc/calloc/realloc/free with an assembly bytepatch on the first instructions (not IAT/PLT hooking).
The hook redirects to a gate which checks if the hooks are enabled in the current thread and redirects to the memory tracking hook function if they are, otherwise it just redirects to the trampoline of the real function if they are disabled for that thread.
The library works great and detects leaks on linux/windows (probably would work on mac but I don't have one).
I use the library to programmatically detect leaks from within my code, I can install callbacks on the memory allocation routines and programmatically raise breakpoints (by looping and waiting for debugger to attach then executing asm("int3")) inside the callbacks so that I can attach to my program while it's inside of a call that leaks memory.
Everything works great up until I try to view a backtrace from within my callback, I understand this is is probably because the unwind information is probably not matching my stack anymore because I have inserted new frames and data via the hook routines I have inserted.
Edit: If I am mistaken about the unwind info mismatching the stack being the cause of the incorrect backtrace then please correct me!
The Question
Is there any small hacks I can do to trick GDB into correctly rebuilding the backtrace from within my hook callbacks?
I understand that I can manually walk and edit the unwind info with libdwarf or something but I imagine that would be incredibly cumbersome and large.
So I am wondering if perhaps there is a hack or a cheat I can do which would trick GDB into properly rebuilding the backtrace?
If there are no easy hacks or tricks then what are all of my options for fixing this issue?
Edit: Just to clear up the exact call order of everything:
program
V
malloc
V
hook_malloc -> hooks are disabled -> return malloc trampoline -> real malloc > program
V
hooks are enabled
V
Call original malloc -> malloc trampoline -> real malloc -> returns to hook
V
Record memory size/info etc from malloc
V
Call user defined callback -> **User defined callback* -> returns to hook
V
return to program
It is the "User Defined Callback" where I want to capture a backtrace
Apparently this is the same problem GDB Windows ?? in Backtraces
And the solution was to simply add -g3 to the mingw compile flags and viola I have non-broken backtraces!
Edit: Nevermind, this isn't the whole answer. It appears like this fix worked for some test programs, but other programs still appear to show incorrect backtraces like:
(gdb) bt
#0 malloc_callback (s=38, rv=0x2c5058) at test_dll.c:729
#1 0x000000000040731d in hook_malloc_raw (file=0x410ea1 <__FUNCTION__.63079+55> "", function=0x410ea1 <__FUNCTION__.63079+55> "", line=0, s=38, rv=8791758343065)
#2 0x0000000000407367 in hook_malloc (s=38)
#3 0x000007fefda20b9e in ?? ()
#4 0x0000000000000026 in ?? ()
#5 0x0000000000410ea1 in __FUNCTION__.63079 ()
#6 0x0000000000000000 in ?? ()
Obviously Frame #4 isn't actually a stack frame, and I'm not sure why frame #5 is labeled "__FUNCTION__.63079".
Edit2: If people are going to downvote this at least leave a comment saying why

Is there a utility to print function call sequence for C programs?

I am trying to run some profiles on DBMSs written in C, primarily the postgres binary. I would like to be able to use a utility to print out the sequence of function calls made made by the program. As a simple example, take this program:
void func1 () {
printf("x\n");
}
void func2 () {
printf("y\n");
func1();
}
int main () {
func2();
func1();
return 0;
}
When compiled and executed with this "utility", I would like to see something along the lines of this:
-> main
-> func2
-> func1
-> func1
<-
Also, I cannot modify either the source code or the makefile, however -g is already enabled.
I know that I have used a profiler in the past that did something similar to this, but I cannot remember which one. I did some googling, and I could't find a good solution that did not require me to change either the source or the makefile.
What profiling tool can I use to accomplish this? Or does one not exist?
Thanks.
I know of no tools to do this directly, but you could use GDB breakpoint commands to get the stack traces along with regex break to break on the functions you're interested in. At that point you should be able to postprocess it to get the output in a format you desire.
For example, you could do something like this (edited for brevity):
$ gdb ./program
(gdb) rbreak program.c:.
...
(gdb) commands
>silent
>bt
>cont
>end
(gdb) run
...
#0 main () at program.c:22
...
#0 foo (number=113383) at program.c:4
#1 main () at program.c:22
...
Program exited normally.
(gdb)
What you want is "tracing", where entry and exit from "every" function is recorded in a file. This is provided in some interpreted languages, but not in compiled, that I know of. I don't know of any profilers that do this.
I quoted "every" because there are huge layers of function calls I'm guessing you don't want, like I/O, memory allocation and freeing, etc.
In fact, almost every single machine instruction is in effect a function call into the micro-code or internal functioning of the CPU chip. I suppose you don't care to see those :)
For any but the tiniest programs, such traces are way too long for any human to read.
If what you're really after is to understand what the program's doing with respect to time, the subject has been very much discussed, and this is my contribution.

How do I get the variable names and line numbers to appear in GDB?

I have a program that I'm writing and I'm getting some memory errors. I run it through gdb and do a where call after the memory error occurs, and I get the following output:
#0 0xb7fdd424 in __kernel_vsyscall ()
#1 0xb7f189b1 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xb7e979fe in ?? () from /lib/i386-linux-gnu/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I've had similar errors before, and the solution was to use the -ggdb compiler option. But I am using that option as you'll see in my Makefile:
shell: myshell.c
gcc -ansi -ggdb -Wall -pedantic-errors -o myshell myshell.c
Why aren't the line numbers or variable names showing up in gdb?
The reason you are not seeing line numbers or variable names is already provided by GDB itself:
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
GDB believes that some error created a situation that it cannot safely recover from, and does not try to decode the backtrace stack any further. It diagnoses the possible cause of the error to be a corrupt stack.
You can proceed to debug the issue in a few ways. One way is to try using valgrind to see if it can identify the source of corruption for you. If it is not already installed on your system, an RPM package should be available, and retrieved in the usual way (probably apt-get or yum, depending on your Linux provider).
You can also try stepping through your program within the debugger until the error occurs. This will at least tell you the line of code within your program at which the error occurs. You can then repeat the experiment up to the error, and inspect the state of your program before the error actually happens.

Easiest way to locate a Segmentation Fault

I encountered my first Segmentation Fault today (newbie programmer). After reading up on what a segmentation fault is (Thanks for all of the helpful info on this site, as well as Wikipedia's lengthy explanation), I'm trying to determine the easiest way to go about finding where my fault is occuring. It's written in C and the error is occuring on a *NIX based system (I'm not sure which one to be honest... 99% sure it's Linux). I can't exactly post my code as I have numerous files that I'm compiling that are all quite lengthy. I was just hoping for some best practices you have all observed. Thanks for your help.
P.s. I'm thinking the error is coming from dereferencing a NULL pointer or using an uninitialized pointer. However, I could definitely be wrong.
Use a debugger, such as gdb or if this is not applicable a strace tool to get a better insight into where the segfault occurs.
If you use gcc, make sure you compile with -g switch to include debugging information. Then, gdb will show you the exact location in a source code where it segfaults.
For example, if we have this obvious segfaulty program:
new.c
#include <stdio.h>
int main()
{
int *i = 0x478734;
printf("%d", *i);
}
We compile it with gcc -g new.c -o new and then run the gdb session with gdb new:
We issue the run command in the interactive session and the else is clear:
(gdb) run
Starting program: /home/Tibor/so/new
[New Thread 9596.0x16a0]
[New Thread 9596.0x1de4]
Program received signal SIGSEGV, Segmentation fault.
0x0040118a in main () at new.c:6
6 printf("%d", *i);
(gdb)
As DasMoeh and netcoder have pointed out, when segfault has occured, you can use the backtrace command in the interactive session to print a call stack. This can aid in further pinpointing the location of a segfault.
The easiest way is to use valgrind. It will pinpoint to the location where the invalid access occours (and other problems which didn't cause crash but were still invalid). Of course the real problem could be somewhere else in the code (eg: invalid pointer), so the next step is to check the source, and if still confused, use a debugger.
+1 for Tibors answer.
On larger programs or if you use additional libraries it may also be useful look at the backtrace with gdb: ftp://ftp.gnu.org/pub/old-gnu/Manuals/gdb/html_node/gdb_42.html
I reopen this posts for people passing here since I've just corrected a segfault I've made using gcc.
You should consider using the flag -fsanitize=address which can sometimes highlight your segfault with high precision.

Is there a downside to using -Bsymbolic-functions?

I recently discovered the linker option "-Bsymbolic-functions" in GNU ld:
-Bsymbolic
When creating a shared library, bind references to global symbols to the
definition within the shared library, if any. Normally, it is possible
for a program linked against a shared library to override the definition
within the shared library.
This option is only meaningful on ELF platforms which support shared libraries.
-Bsymbolic-functions
When creating a shared library, bind references to global function symbols
to the definition within the shared library, if any.
This option is only meaningful on ELF platforms which support shared libraries.
This seems to be the inverse of the GCC option -fvisibility=hidden, in that instead of preventing the export of the referenced function to other shared objects, it prevents library-internal references to that function from being bound to an an exported function of a different shared object. I informed myself that -Bsymbolic-functions will prevent the creation of PLT entries for the functions, which is a nice side effect.
But I was wondering whether there is perhaps a finer-grained control over this, like overwriting -Bsymbolic for individual function definitions of a library.
Should I be aware of any pitfalls of using -Bsymbolic-functions? I plan to only use that, because the -Bsymbolic will break exceptions, I think (it will make it so that references to typeinfo objects are not unified, I think).
Thanks!
Answering my own question because I just earned a Tumbleweed badge for it... and I found out subsequently
But I was wondering whether there is perhaps a finer-grained control over this, like overwriting -Bsymbolic for individual function definitions of a library.
Yes, there is the option --dynamic-list which does exactly that
Should I be aware of any pitfalls of using -Bsymbolic-functions? I plan to only use that, because the -Bsymbolic will break exceptions, I think (it will make it so that references to typeinfo objects are not unified, I think).
I looked more into it, and it seems there is no issue. The libstdc++ library apparently does it or at least did consider it and they only had to add --dynamic-list-cpp-new to still have operator new unified (to prevent issues with multiple allocator / deallocators mixing up in a program but I would argue such programs are broken anyway). Ubuntu uses it or used it by default, and it seems it causes conflicts with some packages. But overall it should work nicely I expect.
I recently discussed this this with one of the toolchain experts at SUSE. Here are his remarks:
"-Bsymbolic-functions is a thing from an old world which doesn't
exist anymore. It completely bypasses everything about what ELF can
provide, including visibility. When you're using it, everything is bound
locally. IOW: don't use it :)
Noone should use -Bsymbolic-functions, it's a too big hammer for
most purposes."
How does -Bsymbolic-functions relate to library versioning (--version-script) ?
"-Bsymbolic-functions overrides anything, from linker
scripts, from GCC attributes or anywhere, about symbol visibilities or anything. It makes everything bind local, always, irrespective of
anything else that you might have added on command lines, or extra files,
or object files. (And yes, --dynamic-list= was a mis-guided attempt to
fix some of that and make -Bsymbolic* somewhat more friendly). So, yes, it takes precendence over linker script. It's a big hammer :)
"
"To be extra precise: -Bsymbolic-functions is not quite that same as linker
script global/local, which is probably a reason why people still use it
sometimes. While -Bsymbolic-functions does bind references to definitions
locally (like local: in linker scripts), it also keeps them exported
(like the global: ones). In ELF speak that would be somewhat like
PROTECTED visibility. Unfortunately that can't be expressed in a symbol
version script right now, only via GCCs __attribute__(visibility). So,
when people try to get the speed advantage of local binding (fewer symbol
lookups at library load time), while still exporting all their functions
from the shared lib, they unfortunately often end up first finding that
-Bsymbolic-functions "does what I want", without realizing that it creates
problems down the line."
Well you could say it is a "hardening" option as it ensures your calls to in-library functions surely end up there. But one issue that I found is some projects test-suites.
For example the libvirt test-suite would want to call into the just built libvirt0.so but also mock some of the calls that will be done from there.
Due to -Bsymbolic-functions being used on the build that breaks the test as the original and not the mocked function is called.
Example backtraces
Good case:
#0 virHostCPUGetThreadsPerSubcore (arch=VIR_ARCH_PPC64) at ../../../tests/virhostcpumock.c:30
#1 0x00007ffff7c1e4c4 in virHostCPUGetInfoPopulateLinux (cpuinfo=<optimized out>, arch=VIR_ARCH_PPC64, cpus=0x7fffffffdf38, mhz=<optimized out>, nodes=0x7fffffffdf40, sockets=0x7fffffffdf44, cores=0x7fffffffdf48, threads=0x7fffffffdf4c)
at ../../../src/util/virhostcpu.c:661
#2 0x0000555555557e6f in linuxTestCompareFiles (outputfile=0x55555558f150 "/build/libvirt-OUKR8i/libvirt-4.10.0/tests/virhostcpudata/linux-ppc64-subcores2.expected", arch=VIR_ARCH_PPC64,·
cpuinfofile=0x5555555a3f10 "/build/libvirt-OUKR8i/libvirt-4.10.0/tests/virhostcpudata/linux-ppc64-subcores2.cpuinfo") at ../../../tests/virhostcputest.c:44
#3 linuxTestHostCPU (opaque=<optimized out>) at ../../../tests/virhostcputest.c:189
#4 0x000055555555914d in virTestRun (title=0x55555555c0a1 "subcores2", body=0x555555557cc0 <linuxTestHostCPU>, data=0x7fffffffe0c0) at ../../../tests/testutils.c:176
#5 0x000055555555781a in mymain () at ../../../tests/virhostcputest.c:263
#6 0x0000555555559df4 in virTestMain (argc=1, argv=0x7fffffffe2c8, func=0x5555555577b0 <mymain>) at ../../../tests/testutils.c:1114
#7 0x00007ffff79bb09b in __libc_start_main (main=0x5555555576a0 <main>, argc=1, argv=0x7fffffffe2c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe2b8) at ../csu/libc-start.c:308
#8 0x00005555555576ea in _start () at ../../../tests/virhostcputest.c:278
Bad case:
#0 virHostCPUGetThreadsPerSubcore (arch=arch#entry=VIR_ARCH_PPC64) at ../../../src/util/virhostcpu.c:1119
#1 0x00007ffff7c27e04 in virHostCPUGetInfoPopulateLinux (cpuinfo=<optimized out>, arch=VIR_ARCH_PPC64, cpus=0x7fffffffdea8, mhz=<optimized out>, nodes=0x7fffffffdeb0, sockets=0x7fffffffdeb4, cores=0x7fffffffdeb8, threads=0x7fffffffdebc)
at ../../../src/util/virhostcpu.c:661
#2 0x0000555555557e6f in linuxTestCompareFiles (outputfile=0x5555555a5c30 "/build/libvirt-4biJ7f/libvirt-4.10.0/tests/virhostcpudata/linux-ppc64-subcores2.expected", arch=VIR_ARCH_PPC64,·
cpuinfofile=0x55555558fd20 "/build/libvirt-4biJ7f/libvirt-4.10.0/tests/virhostcpudata/linux-ppc64-subcores2.cpuinfo") at ../../../tests/virhostcputest.c:44
#3 linuxTestHostCPU (opaque=<optimized out>) at ../../../tests/virhostcputest.c:189
#4 0x000055555555914d in virTestRun (title=0x55555555c0a1 "subcores2", body=0x555555557cc0 <linuxTestHostCPU>, data=0x7fffffffe030) at ../../../tests/testutils.c:176
#5 0x000055555555781a in mymain () at ../../../tests/virhostcputest.c:263
#6 0x0000555555559df4 in virTestMain (argc=1, argv=0x7fffffffe238, func=0x5555555577b0 <mymain>) at ../../../tests/testutils.c:1114
#7 0x00007ffff79b009b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00005555555576ea in _start () at ../../../tests/virhostcputest.c:278
Compare the source for virHostCPUGetThreadsPerSubcore in those two and you will see the difference.
Another case I have seen are:
static variables becoming multiple instances in singularity
segfaulting tests in sssd
autofs issues with global variables
Since the original question was about potential drawbacks I thought it is worth to mention those somewhat common category of related issues as well.
There are cases with side effects. A documented one:
https://bugs.launchpad.net/ubuntu/+source/xfe/+bug/644645
I would also like to figure out more about it, because I have such a case right now.
building glibc with -Bsymbolic-functions is not recommended neither. Here is the result I got:
Core was generated by `/home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/elf/ld-linux .'.
Program terminated with signal 11, Segmentation fault.
#0 0x400a3e90 in _int_free ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
(gdb) where
#0 0x400a3e90 in _int_free ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#1 0x4016b94b in __libc_dlsym ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#2 0x4004c2c7 in __gconv_find_shlib ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#3 0x40042320 in find_derivation ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#4 0x40042889 in __gconv_find_transform ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#5 0x400d6f00 in __wcsmbs_load_conv ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#6 0x400c86f6 in mbrtowc ()
from /home/lano1106/dev/packages/glibc/repos/core-i686/src/glibc-build/libc.so.6
#7 0x08048914 in ?? ()
#8 0x00000000 in ?? ()

Resources