I'm running out of good ideas on how to crack this bug. I have 1000 lines of code that crashes every 2 or 3 runs. It is currently a prototype command line application written in C. An issue is that it's proprietary and I cannot give you the source, but I'd be happy to send a debug compiled executable to any brave soul on a Debian Squeeze x86_64 machine.
Here is what I got so far:
When I run it in GDB, it always complete successfully.
When I run it in Valgrind, it always complete successfully.
The issue seems to emanate from a recursive function call that is very basic. In an effort to pin point the error in this recursive function I wrote the same function in a separate application. It always completes successfully.
I built my own gcc 4.7.1 compiler, compiled my code with it and I'm still getting the same behavior.
FTped my application to another machine to eliminate the risk of HW issues and I still get the same behavior.
FTped my source code to another machine to eliminate the risk of a corrupt build environment and I still get the same behavior.
The application is single threaded and does no signal handling that might cause race conditions. I memset(,0,) all large objects
There are no exotic dependencies, the ldd follows below.
ldd gives me this:
ldd tst
linux-vdso.so.1 => (0x00007fff08bf0000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fe8c65cd000)
libm.so.6 => /lib/libm.so.6 (0x00007fe8c634b000)
libc.so.6 => /lib/libc.so.6 (0x00007fe8c5fe8000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe8c67fc000)
Are there any tools out there that could help me?
What would be your next step if you were in my position?
Thanks!
This is what got me in the right direction -Wextra I already used -Wall.
THANKS!!! This was really driving me crazy.
I suggested in comments :
to compile with -Wall -Wextra and improve the source code till no warnings are given;
to compile with both -g and -O; this is helpful to inspect dumped core files with gdb (you may want to set a big enough coredump size limit with e.g. ulimit bash builtin)
to show your code to a colleague and explain the issue?
to use ltrace or strace
Apparently -Wextra was helpful. It would be nice to understand why and how.
BTW, for larger programs, you could even add your own warnings to GCC by extending it with MELT; this may take days and is worthwhile mostly in big projects.
In this case, i think that you have some memory problems (see the output of valgrind carefully), cause GDB and valgrind change the original program by adding some memory tracking functions (so your original addresses are changed). You can compile with -ggdb option and set coredump (ulimit -c unlimited) and then trying to analyze what's going on. This link may help you:
http://en.wikipedia.org/wiki/Unusual_software_bug
Regards.
Related
I understand that symbol tables are created by the compiler to help with its process.
They exist per object file for when they are being linked together.
Assume:
void test(void){
//
}
void main(){
return 0;
}
compiling above with gcc and running nm a.out shows:
0000000100000fa0 T _main
0000000100000f90 T _test
Why are these symbols still needed? why doesn't the linker remove them once done? aren't they potentially a security risk for hackers to read the source?
Edit
Is this what you mean by debugging a release binary (the ones compiled without -g)?
Assume:
int test2(){
int *p = (int*) 0x123;
return *p;
}
int test1(){
return test2();
}
int main(){
return test1();
}
which segfaults on test2. doing gdb ./a.out > where shows:
(gdb) where
#0 0x000055555555460a in test2 ()
#1 0x000055555555461c in test1 ()
#2 0x000055555555462c in main ()
But stripping a.out and doing the same shows:
(gdb) where
#0 0x000055555555460a in ?? ()
#1 0x000055555555461c in ?? ()
#2 0x000055555555462c in ?? ()
Is this what you mean by keeping symbol tables for debugging release builds? is this the normal way of doing it? are there other tools used?
Why are these symbols still needed?
They are not needed for correctness of execution, but they are helpful for debugging.
Some programs can record their own stack trace (e.g. TCMalloc performs allocation sampling), and report it on crash (or other kind of errors).
While all such stack traces could be symbolized off-line (given a binary which did contain symbols), it is often much more convenient for the program to produce symbolized stack trace, so you don't need to find a matching binary.
Consider a case where you have 1000s of different applications running in the cloud at multiple versions, and you get 100 reports of a crash. Are they the same crash, or are there different causes?
If all you have are bunches of hex numbers, it's hard to tell. You'd have to find a matching binary for each instance, symbolize it, and compare to all the other ones (automation could help here).
But if you have the stack traces in symbolized form, it's pretty easy to tell at a glance.
This does come with a little bit of cost: your binaries are perhaps 1% larger than they have to be.
why doesn't the linker remove them once done?
You have to remember traditional UNIX roots. In the environment in which UNIX was developed everybody had access to the source for all UNIX utilities (including ld), and debuggability was way more important than keeping things secret. So I am not at all surprised that this default (keep symbols) was chosen.
Compare the to choice made by Microsoft -- keep everything to .DBG (later .PDB files).
aren't they potentially a security risk for hackers to read the source?
They are helpful in reverse engineering, yes. They don't contain the source, so unless the source is already open, they don't add that much.
Still, if your program contains something like CheckLicense(), this helps hackers to concentrate their efforts on bypassing your license checks.
Which is why commercial binaries are often shipped fully-stripped.
Update:
Is this what you mean by keeping symbol tables for debugging release builds?
Yes.
is this the normal way of doing it?
It's one way of doing it.
are there other tools used?
Yes: see best practice below.
P.S. The best practice is to build your binaries with full debug info:
gcc -c -g -O2 foo.c bar.c
gcc -g -o app.dbg foo.o bar.o ...
Then keep the full debug binary app.dbg for when you need to debug crashes, but ship a fully-stripped version app to your customers:
strip app.dbg -o app
P.P.S.
gcc -g is used for gdb. gcc without -g still has symbol tables.
Sooner or later you will find out that you must perform debugging on a binary that is built without -g (such as when the binary built without -g crashes, but one built with -g does not).
When that moment comes, your job will be much easier if the binary still has symbol table.
I'm running OS X 10.12 and I'm developing a basic text-based operating system. I have developed a boot loader and that seems to be running fine. My only problem is that when I attempt to compile my kernel into pure binary, the linker won't work. I have done some research and I think that this is because of the fact OS X runs the Darwin linker and not the GNU linker. Because of this, I have downloaded and installed the GNU binutils. However, it still won't work...
Here is my kernel:
void main() {
// Create pointer to a character and point it to the first cell of video
// memory (i.e. the top-left)
char* video_memory = (char*) 0xb8000;
// At that address, put an x
*video_memory = 'x';
}
And this is when I attempt to compile it:
Hazims-MacBook-Pro:32 bit root# gcc -ffreestanding -c kernel.c -o kernel.o
Hazims-MacBook-Pro:32 bit root# ld -o kernel.bin -T text 0x1000 kernel.o --oformat binary
ld: unknown option: -T
Hazims-MacBook-Pro:32 bit root#
I would love to know how to solve this issue. Thank you for your time.
-T is a gcc compiler flag, not a linker flag. Have a look at this:
With these components you can now actually build the final kernel. We use the compiler as the linker as it allows it greater control over the link process. Note that if your kernel is written in C++, you should use the C++ compiler instead.
You can then link your kernel using:
i686-elf-gcc -T linker.ld -o myos.bin -ffreestanding -O2 -nostdlib boot.o kernel.o -lgcc
Note: Some tutorials suggest linking with i686-elf-ld rather than the compiler, however this prevents the compiler from performing various tasks during linking.
The file myos.bin is now your kernel (all other files are no longer needed). Note that we are linking against libgcc, which implements various runtime routines that your cross-compiler depends on. Leaving it out will give you problems in the future. If you did not build and install libgcc as part of your cross-compiler, you should go back now and build a cross-compiler with libgcc. The compiler depends on this library and will use it regardless of whether you provide it or not.
This is all taken directly from OSDev, which documents the entire process, including a bare-bones kernel, very clearly.
You're correct in that you probably want binutils for this especially if you're coding baremetal; while clang as is purports to be a cross compiler it's far from optimal or usable here, for various reasons. noticing you're developing on ARM I infer; you want this.
https://developer.arm.com/open-source/gnu-toolchain/gnu-rm
Aside from the fact that gcc does this thing better than clang markedly, there's also the issue that ld does not build on OS X from the binutils package; it in some configurations silently fails so you may in fact never have actually installed it despite watching libiberty etc build, it will even go through the motions of compiling the source of that target sometimes and just refuse to link it... to the fellow with the lousy tone blaming OP, if you had relevant experience ie ever had built this under this condition you would know that is patently obnoxious. it'd be nice if you'd refrain from discouraging people from asking legitimate questions.
In the CXXfilt package they mumble about apple-darwin not being a target; try changing FAKE_TARGET to instead of mn10003000-whatever or whatever they used, to apple-rhapsody some time.
You're still in way better shape just building them from current if you say need to strip relocations from something or want to work on restoring static linkage to the system. which is missing by default from that clang installation as well...anyhow it's not really that ld couldn't work with macho, it's all there, codewise in fact...that i am sure of
Regarding locating things in memory, you may want to refer to a linker script
http://svn.screwjackllc.com/?p=noid.git;a=blob_plain;f=new_mbed_bs.link_script.ld
As i have some code in there that will directly place things in memory, rather than doing it on command line it is more reproducible to go with the linker script. it's a little complex but what it is doing is setting up a couple of regions of memory to be used with my memory allocators, you can use malloc, but you should prefer not to use actual malloc; dynamic memory is fine when it isn't dynamic...heh...
The script also sets flags for the stack and heap locations, although they are just markers, not loaded til go time, they actually get placed, stack and heap, by the startup code, which is in assembly and rather readable and well commented (hard to believe, i know)... neat trick, you have some persistence to volatile memory, so i set aside a very tiny bit to flip and you can do things like have it control what bootloader to run on the next power cycle. again you are 100% correct regarding the linker; seems to be you are headed the right direction. incidentally another way you can modify objects prior to loading them , and preload things in memory, similar to this method, well there are a ton of ways, but, check out objcopy and objdump...you can use gdb to dump srecs of structures in memory, note the address, and then before linking but after assembly use dd to insert the records you extracted with gdb back in to extracted sections..is one of my favorite ways just because is smartass route :D also, if you are tight on memory ever and need to precalculate constants it's one way to optimize things...that way is actually closer to what ld is doing, just doing it by hand... probably path of least resistance on this now though is linker script.
I have used dlsym to create a malloc/calloc wrapper in the efence code as to able to able to access the libc malloc (occassionally apart from efence malloc/calloc). Now when i link it, and run, it gives following error: "RTLD_NEXT used in code not dynamically loaded"
bash-3.2# /tool/devel/usr/bin/gcc -g -L/tool/devel/usr/lib/ efence_time_interval_measurement_test.c -o dev.out -lefence -ldl -lpthread
bash-3.2# export LD_LIBRARY_PATH=/tool/devel/usr/lib/
bash-3.2# ./dev.out
eFence: could not resolve 'calloc' in 'libc.so': RTLD_NEXT used in code not dynamically loaded
Now, if i use "libefence.a" it is happening like this:
bash-3.2# /tool/devel/usr/bin/gcc -g -L/tool/devel/usr/lib/ -static
efence_time_interval_measurement_test.c -o dev.out -lefence -ldl -lpthread
/tool/devel/usr/lib//libefence.a(page.o): In function `stringErrorReport':
/home/raj/eFence/BUILD/electric-fence-2.1.13/page.c:50: warning: `sys_errlist' is deprecated; use `strerror' or `strerror_r' instead
/home/raj/eFence/BUILD/electric-fence-2.1.13/page.c:50: warning: `sys_nerr' is deprecated; use `strerror' or `strerror_r' instead
/tool/devel/usr/lib//libc.a(malloc.o): In function `__libc_free':
/home/rpmuser/rpmdir/BUILD/glibc-2.9/malloc/malloc.c:3595: multiple definition of `free'
/tool/devel/usr/lib//libefence.a(efence.o):/home/raj/eFence/BUILD/electric-fence-2.1.13/efence.c:790: first defined here
/tool/devel/usr/lib//libc.a(malloc.o): In function `__libc_malloc':
/home/rpmuser/rpmdir/BUILD/glibc-2.9/malloc/malloc.c:3551: multiple definition of `malloc'
/tool/devel/usr/lib//libefence.a(efence.o):/home/raj/eFence/BUILD/electric-fence-2.1.13/efence.c:994: first defined here
/tool/devel/usr/lib//libc.a(malloc.o): In function `__libc_realloc':
/home/rpmuser/rpmdir/BUILD/glibc-2.9/malloc/malloc.c:3647: multiple definition of `realloc'
/tool/devel/usr/lib//libefence.a(efence.o):/home/raj/eFence/BUILD/electric-fence-2.1.13/efence.c:916: first defined here
Please help me. Is there any problem in linking?
NO ONE IN STACK OVERFLOW WHO CAN RESOLVE THIS
The problem is with your question, not with us ;-)
First off, efence is most likely the wrong tool to use on a Linux system. For most bugs that efence can find, Valgrind can find them and describe them to you (so you could fix them) much more accurately. The only good reason for you to use efence is if your application runs for many hours, and Valgrind is too slow.
Second, efence is not intended to work with static linking, so the errors you get with -static flag are not at all surprising.
Last, you didn't tell us what libc is installed on your system (in /lib), and what libraries are present in /tool/devel/usr/lib/. It is exceedingly likely that there is libc.so.6 present in /usr/devel/usr/lib, and that its version does not match the one installed in /lib.
That would explain the RTLD_NEXT used in code not dynamically loaded error. The problem is that glibc consists of multiple binaries, which all must match exactly. If the system has e.g. libc-2.7 installed, then you are using /lib/ld-linux.so.2 from glibc-2.7 (the dynamic loader is hard-coded into every executable and is not affected by environment variables), and mixing it with libc.so.6 from glibc-2.9. The usual result of doing this is a SIGSEGV, weird unresolved symbol errors, and other errors that make no sense.
I am using GCC crosscompiler to compile to an ARM platform. I have a problem where, using opitmization -O3 gives me a "bad immediate value for offset (4104)" on a temp file ccm4baaa.s. Can't find this file either.
How do I debug this, or find the source of the error? I know that it's located somewhere in hyper.c, but it's impossible to find it because there is no errors showing in hyper.c. Only the cryptic error message above.
Best Regards
Mr Gigu
There have been similar known bugs in previous releases of GCC. It might just be a matter of updating your version of the GCC toolchain. Which one are you using currently?
In order to debug the problem and find the offending source, in these cases it helps to add the gcc option -save-temps to the compilation. The effect is that the compiler keeps the intermediate assembly files (and the pre-processor output) for you to examine.
Plus, The program runs on a arm device running Linux, I can print out stack info and register values in the sig-seg handler I assign.
The problem is I can't add -g option to the source file, since the bug may won't reproduce due to performance downgrade.
Compiling with the -g option to gcc does not cause a "performance downgrade". All it does is cause debugging symbols to be included; it does not affect the optimisation or code generation.
If you install your SIGSEGV handler using the sa_sigaction member of the sigaction struct passed to sigaction(), then the si_addr member of the siginfo_t structure passed to your handler contains the faulting address.
I tend to use valgrind which indicates leaks and memory access faults.
This seems to work
http://tlug.up.ac.za/wiki/index.php/Obtaining_a_stack_trace_in_C_upon_SIGSEGV
static void signal_segv(int signum, siginfo_t* info, void*ptr) {
// info->si_addr is the illegal address
}
If you are worried about using -g on the binary that you load on the device, you may be able to use gdbserver on the ARM device with a stripped version of the executable and run arm-gdb on your development machine with the unstripped version of the executable. The stripped version and the unstripped version need to match up to do this, so do this:
# You may add your own optimization flags
arm-gcc -g program.c -o program.debug
arm-strip --strip-debug program.debug -o program
# or
arm-strip --strip-unneeded program.debug -o program
You'll need to read the gdb and gdbserver documentation to figure out how to use them. It's not that difficult, but it isn't as polished as it could be. Mainly it's very easy to accidentally tell gdb to do something that it ends up thinking you meant to do locally, so it will switch out of remote debugging mode.
You may also want to use the backtrace() function if available, that will provide the call stack at the time of the crash. This can be used in order to dump the stack like it happens in an high level programming language when a C program gets a segmentation fault, bus error, or other memory violation error.
backtrace() is available both on Linux and Mac OS X
If the -g option makes the error disappear, then knowing where it crashes is unlikely to be useful anyway. It's probably writing to an uninitialized pointer in function A, and then function B tries to legitimately use that memory, and dies. Memory errors are a pain.