How to debug the memory is changed randomly issue - c

My application is a multi-thread program that runs on Solaris.
Recently, I found it may crash, and the reason is one member in a pointer array is changed from a valid value to NULL,so when accessing it, it crashed.
Because the occurrence ratio is very low, in the past 2 months, it only occurred twice, and the changed members in the array aren't the same. I can't find the repeated steps, and after reviewing code, there is no valuable clue gotten.
Could anyone give some advice on how to debug the memory is changed randomly issue?

Since you aren't able to reproduce the crash, debugging it isn't going to be easy.
However, there are some things you can do:
Go through the code and make a list of all of the places in the code that write to that variable--particularly the ones that could write a NULL to it. It's likely that one of them is your culprit.
Try to develop some kind of torture test that makes the fault more likely to occur (eg running through simulated or random transactions at top speed). If you can reproduce the crash this way you'll be in a much better situation, as you can then analyze the actual cause of the crash instead of just speculating.
If possible, run the program under valgrind or purify or similar. If they give any warnings, track down what is causing those warnings and fix it; it's possible that your program is eg accessing memory that has been freed, which might seem to work most of the time (if the free memory hasn't been reused for anything when it is accessed) but would fail occasionally (when something is reusing it)
Add a memory checker like Electric Fence to your code, or just replace free() with a custom version that overwrites the free memory with random garbage in the hopes that this will make the crash more likely to occur.
Recompile your program using different compilers (especially new/fancy ones like clang++ with the static analyzer enabled) and fix whatever they warn about. This may point you to your problem.
Run the program under different hardware and OS's; sometimes an obscure problem under one OS gives really obvious symptoms on another.
Review the various machines where the crash is known to have occurred. Do they all have anything in common? What about the machines where it hasn't crashed? Is there something different about them?
Step 2 is really the most important one, because even if you think you have fixed the problem, you won't be able to prove it unless you can reproduce the crash in the old code, and cannot reproduce it with the fixed code. Without being able to reproduce the fault, you're just guessing about whether a particular code change actually helps or not.

Related

What are best practices for finding a bug in a C program that only shows up in optimized build

My program uses a third part library that throws segmentation fault at some point. I tried to compile the library with debug symbols and without compiler optimization, and the crash gone away. My suspect is that compiler optimizations revealed this bug. What are best practices for debugging cases like this?
EDIT - (corrected the statement above: "revealed" instead of "caused")
I think I was misunderstood. I didn't have an intention to blame compiler, or something like that. I only asked for best practices for finding a bug in such a situation, where I don't have debug symbols in the 3rd party library (the crash backtrace leads to the 3rd party library).
What you describe is quite common. And it's almost never ever a bug in the compiler optimization. Optimization does a lot of things to your code. Variables get reordered/optimized away etc. If you have one buffer overflow, it might just overflow memory that's no big deal in the debug build, but that memory is very important in the optimization build.
Use valgrind to track down memory errors - they're almost always the cause of the symptoms you see.
Your suspicion is that optimization caused a bug. My suspicion is that your code has constructs that lead to Undefined Behavior, and when the optimizer is on, this Undefined Behavior manifests itself as erroneous behavior or crash. Don't blame the optimizer. Find UB in your code... might be tricky, though. Possible culprits:
OutOfBounds index
Returning the address a temprorary
A zillion of other things
Compile with debug symbols and compiler optimization, it will "hopefully" fail as well. Allow the system to generate a core file (ulimit -c unlimited, then re-run the program). Load the core file into gdb to see what happened.
Another powerful tool is valgrind, run your program within valgrind with the option --db-attatch=yes it will stop and run the debugger as soon as it detects an invalid read or write. Invalid reads/writes are likely to provoke Segfault, and even if they don't, they should be removed anyway.
Good luck,
Keep putting debug statements or messageboxes in the place you think the code is crashing. The crash will occur between two messageboxes and this will help you locate the faulty code as long as the code wasn't changed too much.
Also comment out blocks of code until the crash stops coming. Keep commenting back in until the crash returns. What you last commented back in must be causing the crash, directly or indirectly.
Both of these methods are useful for general debugging and half your work is already done if you are able to reliably reproduce the crash.
I did not give specific advice for debugging compiler optimisations because it's highly unlikely the crash is caused by that. The optimisations are generally tested very robustly to ensure they do not change the function or semantics of the code in any way.
If the backtrace leads to the third-party library, use gdb to break before the library call. Verify that the parameters you're passing to the library are valid (i.e., aren't uninitialized pointers, aren't pointers to free'd memory, aren't out of range, etc.)
Can you use strace to trace the function calls and then try to determine the execution path in the third-party library? Use a printf or some other system call before the failing library call so you have a starting point in the strace output.
If you really think it's a bug in the third-party library, you'll have to compile it with optimizations on so you can reproduce the failure. Are you saying that your compiler can only include debug symbols for non-optimized builds? gdb should still work for optimized builds.
Well, going through the compiled binary isn't going to help.
So that leaves going through your code to find out what part is causing the segfault. I would just work through your code manually and start commenting things out. Once you find what's causing the error, then you can determine what to do with it. It might be worth adding printfs in select locations to see exactly where the program fails.
Think of it as doing a binary search for the error ;)
If it only blows up when you turn on optimization, then that's a strong hint you've invoked undefined behavior somewhere. Unfortunately, that UB may be nowhere near the code that actually generated the segfault (as I've discovered several times in the past).
Every time this has happened to me (which hasn't been that often), the cause was a buffer overflow somewhere else in the code. I never developed a repeatable, generally applicable technique for finding the problem, though (unless you want to call hours stepping through a debugger and swearing a generally applicable technique).

Boehm GC: how to effectively debug smashed heap objects?

When running my program I get the following errors from the Boehm GC (with GC_DEBUG defined):
GC_check_heap_block: found smashed heap objects:
0x8ef1008 in or near object at 0x8ef1010(<smashed>, appr. sz = 29)
0x8ef1188 in or near object at 0x8ef1190(<smashed>, appr. sz = 29)
...
The above continues about 20 times.
Oddly, I can't find anything wrong with the program, it does what it is supposed to, and does not crash.
I can compile my program disabling the GC. Then I can run valgrind with it, but oddly enough, valgrind doesn't find any problems!
Could it be a problem within Boehm GC -- should I just ignore it?
Does anyone have any ideas how to effectively debug this?
Or, can anyone explain what precisely the above message means?
To answer my own question more than 3 months later...
I've tried logging every pointer into a file, and comparing with pointers that gave the smashed warning. However, that didn't lead anywhere, the suspect pointers were coming from various allocations all over the codebase (no one particular place that was maybe broken).
In the meantime, without GC, valgrind didn't report any errors, but of course that doesn't mean it's not possible errors still exist.
However, I figured I'd try if this particular version of the GC has a subtle bug maybe. I was using the latest stable version GC 7.1. I upgraded to 7.2alpha4, and the problem went away!
If someone runs across this, hopefully this will help.

Methods/Tools for solving a Mystery Segfault while running on condor

I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
if you can, compile with debugging, and run under gdb.
alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger
http://nmi.cs.wisc.edu/node/1610
http://nmi.cs.wisc.edu/node/1611
Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.
Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.
If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.
The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?
If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).
You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.
The one thing you do not say is how much flexibility you have to solve the problem.
Can you, for example, have the system come to a halt and just run your application?
Also how important are these crashes to solve?
I am assuming that for the most part you do. This may require a lot of resources.
The short term step is to put tons of "asserts" ( semi handwritten ) of each variable
to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.
Long term-- try running it on a cluster of two ( maybe your home computer and a VM ).
Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.
Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.
At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.

Need help with buffer overrun

I've got a buffer overrun I absolutely can't see to figure out (in C). First of all, it only happens maybe 10% of the time or so. The data that it is pulling from the DB each time doesn't seem to be all that much different between executions... at least not different enough for me to find any discernible pattern as to when it happens. The exact message from Visual Studio is this:
A buffer overrun has occurred in
hub.exe which has corrupted the
program's internal state. Press
Break to debug the program or Continue
to terminate the program.
For more details please see Help topic
'How to debug Buffer Overrun Issues'.
If I debug, I find that it is broken in __report_gsfailure() which I'm pretty sure is from the /GS flag on the compiler and also signifies that this is an overrun on the stack rather than the heap. I can also see the function it threw this on as it was leaving, but I can't see anything in there that would cause this behavior, the function has also existed for a long time (10+ years, albeit with some minor modifications) and as far as I know, this has never happened.
I'd post the code of the function, but it's decently long and references a lot of proprietary functions/variables/etc.
I'm basically just looking for either some idea of what I should be looking for that I haven't or perhaps some tools that may help. Unfortunately, nearly every tool I've found only helps with debugging overruns on the heap, and unless I'm mistaken, this is on the stack. Thanks in advance.
You could try putting some local variables on either end of the buffer, or even sentinels into the (slightly expanded) buffer itself, and trigger a breakpoint if those values aren't what you think they should be. Obviously, using a pattern that is not likely in the data would be a good idea.
While it won't help you in Windows, Valgrind is by far the best tool for detecting bad memory behavior.
If you are debugging the stack, your need to get to low level tools - place a canary in the stack frame (perhaps a buffer filled with something like 0xA5) around any potential suspects. Run the program in a debugger and see which canaries are no longer the right size and contain the right contents. You will gobble up a large chunk of stack doing this, but it may help you spot exactly what is occurring.
One thing I have done in the past to help narrow down a mystery bug like this was to create a variable with global visibility named checkpoint. Inside the culprit function, I set checkpoint = 0; as the very first line. Then, I added ++checkpoint; statements before and after function calls or memory operations that I even remotely suspected might be able to cause an out-of-bounds memory reference (plus peppering the rest of the code so that I had a checkpoint at least every 10 lines or so). When your program crashes, the value of checkpoint will narrow down the range you need to focus on to a handful of lines of code. This may be a bit overkill, I do this sort of thing on embedded systems (where tools like valgrind can't be used) but it should still be useful.
Wrap it in an exception handler and dump out useful information when it occurs.
Does this program recurse at all? If so, I check there to ensure you don't have an infinite recursion bug. If you can't see it manually, sometimes you can catch it in the debugger by pausing frequently and observing the stack.

C code on Linux under gdb runs differently if run standalone?

I have built a plain C code on Linux (Fedora) using code-sorcery tool-chain. This is for ARM Cortex-A8 target. This code is running on a Cortex A8 board, running embedded Linux.
When I run this code for some test case, which does dynamic memory allocation (malloc) for some large size (10MB), it crashes after some time giving error message as below:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
Then, when I debug this code for the same test case using gdb built for the target, the point where this dynamic memory allocation happens, code fails to allocate that memory and malloc returns NULL. But during normal stand-alone run, I believe malloc should be failing to allocate but it strangely might not be returning NULL, but it crashes and the OS kills my process.
Why is this behaviour different when run under gdb and when without debugger?
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message I am getting is else?
How do I fix this?
thanks,
-AD
So, for this part of the question, there is a surefire answer:
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message i am getting is else?
In Linux, by default the kernel interfaces for allocating memory almost never fail outright. Instead, they set up your page table in such a way that on the first access to the memory you asked for, the CPU will generate a page fault, at which point the kernel handles this and looks for physical memory that will be used for that (virtual) page. So, in an out-of-memory situation, you can ask the kernel for memory, it will "succeed", and the first time you try to touch that memory it returned back, this is when the allocation actually fails, killing your process. (Or perhaps some other unfortunate victim. There are some heuristics for that, which I'm not incredibly familiar with. See "oom-killer".)
Some of your other questions, the answers are less clear for me.
Why is this behaviour different when run under gdb and when without debugger?It could be (just a guess really) that GDB has its own malloc, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...
How do i fix this?
This is a bit of a sledgehammer solution (that is, it changes the behavior for all processes rather than just your own, and it's generally not a good idea to have your program alter global state like that), but you can write the string 2 to /proc/sys/vm/overcommit_memory. See this link that I got from a Google search.
Failing that... I'd just make sure you're not allocating more than you expect to.
By definition running under a debugger is different than running standalone. Debuggers can and do hide many of the bugs. If you compile for debugging you can add a fair amount of code, similar to compiling completely unoptimized (allowing you to single step or watch variables for example). Where compiling for release can remove debugging options and remove code that you needed, there are many optimization traps you can fall into. I dont know from your post who is controlling the compile options or what they are.
Unless you plan to deliver the product to be run under the debugger you should do your testing standalone. Ideally do your development without the debugger as well, saves you from having to do everything twice.
It sounds like a bug in your code, slowly re-read your code using new eyes as if you were explaining it to someone, or perhaps actually explain it to someone, line by line. There may be something right there that you cannot see because you have been looking at it the same way for too long. It is amazing how many times and how well that works.
I could also be a compiler bug. Doing things like printing out the return value, or not can cause the compiler to generate different code. Adding another variable and saving the result to that variable can kick the compiler to do something different. Try changing the compiler options, reduce or remove any optimization options, reduce or remove the debugger compiler options, etc.
Is this a proven system or are you developing on new hardware? Try running without any of the caches enabled for example. Working in a debugger and not in standalone, if not a compiler bug can be a timing issue, single stepping flushes the pipline, mixes the cache up differently, gives the cache and memory system an eternity to come up with a result which it doesnt have in real time.
In short there is a very long list of reasons why running under a debugger hides bugs that you cannot find until you test in the final deliverable like environment, I have only touched on a few. Having it work in the debugger and not in standalone is not unexpected, it is simply how the tools work. It is likely your code, the hardware, or your tools based on the description you have given so far.
The fastest way to eliminate it being your code or the tools is to disassemble the section and inspect how the passed values and return values are handled. If the return value is optimized out there is your answer.
Are you compiling for a shared C library or static? Perhaps compile for static...

Resources