Debug core file with no symbols - c

I have a C application we have deployed to a customers site. It was compiled and runs on HP-UX. The user has reported a crash and we have obtained a core dump. So far, I've been unable to duplicate the crash in house.
As you would suspect, the core file/deployed executable is completely devoid of any sort of symbols. When I load it up in gdb and do a bt, the best I get is this:
(gdb) bt
#0 0xc0199470 in ?? ()
I can do a 'strings core' on the file, but my understanding is that all I get there is all the strings in the executable, so it seems semi-impossible to track down anything there.
I do have a debug version (compiled with -g) of the executable, which is unfortunately a couple of months newer than the released version. If I try to start gdb with that hub, I see this:
warning: exec file is newer than core file.
Core was generated by `program_name'.
Program terminated with signal 11, Segmentation fault.
__dld_list is not valid according to __dld_flags.
#0 0xc0199470 in ?? ()
(gdb) bt
#0 0xc0199470 in ?? ()
While it would be feasible to compile a debug version and deploy it at the customer's site and then wait for another crash, it would be relatively difficult and undesirable for a number of reasons.
I am quite familiar with the code and have a relatively good idea of where in code it is crashing based on the customer's bug report.
Is there ANY way I can glean any more information from this core dump? Via strings or another debugger or anything? Thanks.

This type of response from gdb:
(gdb) bt
#0 0xc0199470 in ?? ()
can also happen in the case that the stack was smashed by a buffer overrun, where the return address was overwritten in memory, so the program counter gets set to a seemingly random area.
This is one of the ways that even a build with a corresponding symbol database can cause a symbol lookup error (or strange looking backtraces). If you still get this after you have the symbol table, your problem is likely that your customer's data is causing some issues with your code.

For the future:
Make sure that you always build with an external symbols database (this is not a debug build -- it's a release build, but you store the symbol table separately)
keep it around for versions you deploy
For this situation:
You know the general area, so to see if you are right, go to the stack trace and find the assembly code -- eyeball it and see if you think it matches your source (this is easier if you have some idea what source generated this assembly). If it looks right, then you have some verification on your hypothesis. You might be able to figure out the values of the local variables by looking at the stack (since you know what you passed in and declared).

Under gdb, "info registers" should give you enough of the execution state at the time of the crash to use with a disassembly of the executable and and relevant shared libraries. I usually use objdump to disassemble, redirect output to a file, then bring up the file in my favorite editor - this is useful for keeping notes as things are figured out. Also gdb's "info target" and "info sharedlib" can be useful for figuring out where shared libraries are loaded.
With register state, stack contents, and disassembly in hand along with a little luck, it should be straightforward (if tedious) to reconstruct the callstack (unless, of course, the stack has been trashed by a buffer overrun or similar catastrophe... might need an Ouija board or crystal ball in that case.)
You might also be able to correlate a a disassembly of the newer version built with -g against the disassembly of the stripped version.

Always use source control (CVS/GIT/Subversion/etc), even for test releases
Tag all releases
Consider (in the future) making a build with debugging (-g) and strip the executable before shipping. NOTE: Don't make two builds with and without -g; they may well not match up, since -g can on occasion cause different code to be generated even at the same optimization level. In super-performance-critical code you can forgo the -g for critical files - most it won't make a difference to.
If you're really stuck, dump the stack and dump relevant parts of the heap to hex and look at it by hand; perhaps taking an instrumented copy and looking for similar "signatures" in the generated code and on the stack. This is real "old-school" debugging... :-)

Do you have the exact source that you used to compile the old version (eg; through a tag in the source tree or something like that)? Maybe you could rebuild using that, and possibly get an insight into where the crash occured?

Try running a "pmap" against the core file (if hp/ux has this tool). This should report the starting addresses of all modules in the core file. With this info, you should be able to take the address of the failure location and figure out what library crashed. Further address comparison between the crash address and the addresses of the known functions in the library ("nm" against the library should get that) may help you determine what function crashed.
Even if you do manage to identify the function at the top of the stack, it isn't very likely that this function is the source of the problem... hopefully it has actually crashed in your code and not, say, the standard C string library. Rebuilding the stack trace is the next-best thing at that point.

There is not much information here. The binary is stripped.But looking at segmentation fault...you should look for places where there is a possibility that you are overwriting a piece of memory.
This is just a suggestion. There can be many problems.
BTW, if you are not able to reproduce in your local machine then the volume of data on customers' might be a problem.

I don't think the core file is supposed to contain symbols. You need to able to build a version of your program that is exactly the same as what you shipped to your customer, but with -g. If you strip your debug executable, it should be identical to the shipped version. Only then can gdb give you anything useful.

Related

Unreadable instruction at address

I get segmentation fault on a certain scenario(it is C code with DEC VAX FMS(Forms Management System) calls to get a certain field on a CRT screen - pretty old legacy code). I am on an AIX machine, and have only dbx installed on it. GDB, valgrind etc. are not available.
Here is what I get when I try to debug:
Unreadable instruction at address 0x53484950
I do not know how to proceed from here.
I have tried a few things:
1.
(dbx) up
not that many levels
(dbx) down
not that many levels
(dbx) n
where
Segmentation fault in . at 0x53484950 ($t1)
0x53484950 (???) Unreadable instruction at address 0x53484950
Tried tracei(for machine instructions), dump(dump gives so much output, I am unable to make sense of it) etc. but nothing seems to help.
(dbx) &0x53484950/X
expected variable, found "1397246288"
I am used to getting a stack trace on "where" and going on from there. This is something I have not encountered before, and it appears I am not very good at dbx either. Any help to get to at least the line of code that is causing trouble is appreciated.
Once you have hit a segfault, there is no way to continue, so the n command is not going to do anything. At that point, all you can do is examine the stack and the variables, and that will be meaningless unless you have the source code and can recompile it.
In fact, without the source code, I am not sure how you could possibly proceed with fixing the program. Even if you could "decompile" the program, or at least disassemble the program, the risk of making a mistake when trying to patch the binary in order to fix it is virtually 100%.
I'm sorry. Given the limitations you are working under, I would argue the the problem is insolvable. Without tools such as gdb or valgind, it will be difficult to find the problem, and without the source code, it will be very difficult to fix the problem once you have found it.

What is the structure of a MPW tool's main symbol?

This question is about Mac OS Classic, which has been obsolete for several years now. I hope someone still knows something about it!
I've been building a PEF executable parser for the past few weeks and I've plugged a PowerPC interpreter to it. With a good dose of wizardry, I would expect to be able to run (to some extent) some Mac OS 9 programs under Mac OS X. In fact, I'm now ready to begin testing with small applications.
To help me with that, I have installed an old version of Mac OS inside SheepShaver and downloaded the (now free) MPW Tools1, and I built a "hello world" MPW tool (just your classic puts("Hello World!") C program, except compiled for Mac OS 9).
When built, this generates a program with a code section and a data section. I expected that I would be able to just jump to the main symbol of the executable (as specified in the header of the loader section), but I hit a big surprise: the compiler placed the main symbol inside the data section.
Obviously, there's no executable code in the data section.
Going back to the Mac OS Runtime Architectures document (published in 1997, surprisingly still up on Apple's website), I found out that this is totally legal:
Using the Main Symbol as a Data Structure
As mentioned before, the
main symbol does not have to point to a routine, but can point to a
block of data instead. You can use this fact to good effect with
plug-ins, where the block of data referenced by the main symbol can
contain essential information about the plug-in. Using the main symbol
in this fashion has several advantages:
The Code Fragment Manager
returns the address of the main symbol when you programmatically
prepare a fragment, so you do not need to call FindSymbol.
You do
not have to reserve and document the specific name of an export for
your plug-in.
However, not having a specific symbol name means that
the plug-in’s purpose is not quite as obvious. A plug-in can store its
name, icon, or information about its symbols in the main symbol data
structure. Storing symbolic information in this fashion eliminates the
need for multiple FindSymbol calls.
My conclusion, therefore, is that MPW tools run as plugins inside the MPW shell, and that the executable's main symbol points to some data structure that should tell it how to start.
But that still doesn't help me figure out what's in that data structure, and just looking at its hex dump has not been very instructive (I have an idea where the compiler put the __start address for this particular program, but that's definitely not enough to make a generic MPW shell "replacement"). And obviously, most valuable information sources on this topic seem to have disappeared with Mac OS 9 in 2004.
So, what is the format of the data structure pointed by the main symbol of a MPW tool?
1. Apparently, Apple very recently pulled the plug of the FTP server that I got the MPW Tools from, so it probably is not available anymore; though a google search for "MPW_GM.img.bin" does find some alternatives).
As it turns out, it's not too complicated. That "data structure" is simply a transition vector.
I didn't realize it right away because of bugs in my implementation of the relocation virtual machine that made these two pointers look like garbage.
Transition vectors are structures that contain (in this order) an entry point (4 bytes) and a "table of contents" offset (4 bytes). This offset should be loaded into register r2 before executing the code pointed to by the entry point.
(The Mac OS Classic runtime only uses the first 8 bytes of a transition vector, but they can technically be of any size. The address of the transition vector is always passed in r12 so the callee may access any additional information it would need.)

Stack corruption; what is supposed to happen to a variable stored in a register across function calls?

I've been chasing down a crash that seems to be due to memory corruption. The setting is C, building using llvm for iOS.
The memory corruption is absent in debug mode and with optimization level 0 (-O0). In order to be able to step through the code, I've rebuilt with debug symbols but optimization level 1 (-O1). This is enough to reproduce the crash and it allows me to step through the program.
I've narrowed the problem down to a particular function call. Before it, the value of a certain pointer is correct. After it, the value is corrupted (it ends up equalling 0x02, for whatever that is worth).
Lldb doesn't seem to want to watch variables or memory locations. Switching to gdb, I find that if I try to print the address of the aforementioned variable I encounter the following message: "Address requested for identifier 'x' which is in register $r4".
I understand that, as an optimization, the compiler may decide to keep the value of a variable in a register. True enough, if I print the value of $r4 before and after the function call, I see the correct value before and 0x02 after.
I'm in a bit over my head at this point and not sure how to split this into smaller problems. My questions are therefore these:
assuming that the compiler is storing the value of a variable in a register as an optimization, what is supposed to happen to that register when another function is invoked?
Is there some mechanism whereby the value is stored and restored once the new function returns?
Any recommendations on debugging techniques?
All help and suggestions appreciated. Links to reading material on the subject also quite welcome.
Thanks
EDIT: adding version information
iOS version: 5.1
llvm version: i686-apple-darwin10-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2377.00)
Xcode version: 4.3.1
Gdb version: GNU gdb 6.3.50-20050815 (Apple version gdb-1708)
Running on an iPhone 3Gs (crash does not appear in simulator)
Not a full answer, but
assuming that the compiler is storing the value of a variable in a
register as an optimization, what is supposed to happen to that
register when another function is invoked?
The register should probably be pushed to the stack by the callee.
Is there some mechanism whereby the value is stored and restored once
the new function returns?
Depends on the calling conventions, but in general - whoever pushed it to stack is responsible to pop it from the stack
Last thing:
If you meet such case, where "it works" on some optimisation level and doesn't on other, you are very likely to have an undefined behavior. If you can't find it in your code, you can ask it here, giving the actual code.
Try using Valgrind if possible. This looks like a good starting point.
Also, try enabling -fstack-protector for your program.

Methods/Tools for solving a Mystery Segfault while running on condor

I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
if you can, compile with debugging, and run under gdb.
alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger
http://nmi.cs.wisc.edu/node/1610
http://nmi.cs.wisc.edu/node/1611
Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.
Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.
If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.
The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?
If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).
You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.
The one thing you do not say is how much flexibility you have to solve the problem.
Can you, for example, have the system come to a halt and just run your application?
Also how important are these crashes to solve?
I am assuming that for the most part you do. This may require a lot of resources.
The short term step is to put tons of "asserts" ( semi handwritten ) of each variable
to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.
Long term-- try running it on a cluster of two ( maybe your home computer and a VM ).
Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.
Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.
At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.

C code on Linux under gdb runs differently if run standalone?

I have built a plain C code on Linux (Fedora) using code-sorcery tool-chain. This is for ARM Cortex-A8 target. This code is running on a Cortex A8 board, running embedded Linux.
When I run this code for some test case, which does dynamic memory allocation (malloc) for some large size (10MB), it crashes after some time giving error message as below:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
Then, when I debug this code for the same test case using gdb built for the target, the point where this dynamic memory allocation happens, code fails to allocate that memory and malloc returns NULL. But during normal stand-alone run, I believe malloc should be failing to allocate but it strangely might not be returning NULL, but it crashes and the OS kills my process.
Why is this behaviour different when run under gdb and when without debugger?
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message I am getting is else?
How do I fix this?
thanks,
-AD
So, for this part of the question, there is a surefire answer:
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message i am getting is else?
In Linux, by default the kernel interfaces for allocating memory almost never fail outright. Instead, they set up your page table in such a way that on the first access to the memory you asked for, the CPU will generate a page fault, at which point the kernel handles this and looks for physical memory that will be used for that (virtual) page. So, in an out-of-memory situation, you can ask the kernel for memory, it will "succeed", and the first time you try to touch that memory it returned back, this is when the allocation actually fails, killing your process. (Or perhaps some other unfortunate victim. There are some heuristics for that, which I'm not incredibly familiar with. See "oom-killer".)
Some of your other questions, the answers are less clear for me.
Why is this behaviour different when run under gdb and when without debugger?It could be (just a guess really) that GDB has its own malloc, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...
How do i fix this?
This is a bit of a sledgehammer solution (that is, it changes the behavior for all processes rather than just your own, and it's generally not a good idea to have your program alter global state like that), but you can write the string 2 to /proc/sys/vm/overcommit_memory. See this link that I got from a Google search.
Failing that... I'd just make sure you're not allocating more than you expect to.
By definition running under a debugger is different than running standalone. Debuggers can and do hide many of the bugs. If you compile for debugging you can add a fair amount of code, similar to compiling completely unoptimized (allowing you to single step or watch variables for example). Where compiling for release can remove debugging options and remove code that you needed, there are many optimization traps you can fall into. I dont know from your post who is controlling the compile options or what they are.
Unless you plan to deliver the product to be run under the debugger you should do your testing standalone. Ideally do your development without the debugger as well, saves you from having to do everything twice.
It sounds like a bug in your code, slowly re-read your code using new eyes as if you were explaining it to someone, or perhaps actually explain it to someone, line by line. There may be something right there that you cannot see because you have been looking at it the same way for too long. It is amazing how many times and how well that works.
I could also be a compiler bug. Doing things like printing out the return value, or not can cause the compiler to generate different code. Adding another variable and saving the result to that variable can kick the compiler to do something different. Try changing the compiler options, reduce or remove any optimization options, reduce or remove the debugger compiler options, etc.
Is this a proven system or are you developing on new hardware? Try running without any of the caches enabled for example. Working in a debugger and not in standalone, if not a compiler bug can be a timing issue, single stepping flushes the pipline, mixes the cache up differently, gives the cache and memory system an eternity to come up with a result which it doesnt have in real time.
In short there is a very long list of reasons why running under a debugger hides bugs that you cannot find until you test in the final deliverable like environment, I have only touched on a few. Having it work in the debugger and not in standalone is not unexpected, it is simply how the tools work. It is likely your code, the hardware, or your tools based on the description you have given so far.
The fastest way to eliminate it being your code or the tools is to disassemble the section and inspect how the passed values and return values are handled. If the return value is optimized out there is your answer.
Are you compiling for a shared C library or static? Perhaps compile for static...

Resources