Bizarre bug in C [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
So I have a C program. And I don't think I can post any code snippets due to complexity issues. But I'll outline my error, because it's weird, and see if anyone can give any insights.
I set a pointer to NULL. If, in the same function where I set the pointer to NULL, I printf() the pointer (with "%p"), I get 0x0, and when I print that same pointer a million miles away at the end of my program, I get 0x0. If I remove the printf() and make absolutely no other changes, then when the pointer is printed later, I get 0x1, and other random variables in my structure have incorrect values as well. I'm compiling it with GCC on -O2, but it has the same behavior if I take off optimization, so that's not hte problem.
This sounds like a Heisenbug, and I have no idea why it's happening, nor how to fix it. Does anyone who has dealt with something like this in the past have advice on how they approached this kind of problem? I know this may sound kind of vague.
EDIT: Somehow, it works now. Thank you, all of you, for your suggestions.
The debugger told me interesting things - that my variable was getting optimized away. So I rewrote the function so it didn't need the intermediate variable, and now it works with and without the printf(). I have a vague idea of what might have been happening, but I need sleep more than I need to know what was happening.

Are you using multiple threads? I've often found that the act of printing something out can be enough to effectively suppress a race condition (i.e. not remove the bug, just make it harder to spot).
As for how to diagnose/fix it... can you move the second print earlier and earlier until you can see where it's changing?
Do you always see 0x1 later on when you don't have the printf in there?
One way of avoiding the delay/synchronization of printf would be to copy the pointer value into another variable at the location of the first printf and then print out that value later on - so you can see what the value was at that point, but in a less time-critical spot. Of course, as you've got odd value "corruption" going on, that may not be as reliable as it sounds...
EDIT: The fact that you're always seeing 0x1 is encouraging. It should make it easier to track down. Not being multithreaded does make it slightly harder to explain, admittedly.
I wonder whether it's something to do with the extra printf call making a difference to the size of stack. What happens if you print the value of a different variable in the same place as the first printf call was?
EDIT: Okay, let's take the stack idea a bit further. Can you create another function with the same sort of signature as printf and with enough code to avoid it being inlined, but which doesn't actually print anything? Call that instead of printf, and see what happens. I suspect you'll still be okay.
Basically I suspect you're screwing with your stack memory somewhere, e.g. by writing past the end of an array on the stack; changing how the stack is used by calling a function may be disguising it.

If you're running on a processor that supports hardware data breakpoints (like x86), just set a breakpoint on writes to the pointer.

Do you have a debugger available to you? If so, what do the values look like in that? Can you set any kind of memory/hardware breakpoint on the value? Maybe there's something trampling over the memory elsewhere, and the printf moves things around enough to move or hide the bug?
Probably worth looking at the asm to see if there's anything obviously wrong there. Also, if you haven't already, do a full clean rebuild. If the definition of the struct has changed recently, there's a vague change that the compiler could be getting it wrong if the dependency checking failed to correctly rebuild everything it needed to.

Have you tried setting a condition in your debugger which notifies you when that value is modified? Or running it through Valgrind? These are the two major things that I would try, especially Valgrind if you're using Linux. There's no better way to figure out memory errors.

Without code, it's a little hard to help, but I understand why you don't want to foist copious amounts on us.
Here's my first suggestion: use a debugger and set a watchpoint on that pointer location.
If that's not possible, or the bug disappears again, here's my second suggestion.
1/ Start with the buggy code, the one where you print the pointer value and you see 0x1.
2/ Insert another printf a little way back from there (in terms of code execution path).
3/ If it's still 0x1, go back to step 2, moving a little back through the execution path each time.
4/ If it's 0x0, you know where the problem lies.
If there's nothing obvious between the 0x0 printf and the 0x1 printf, it's likely to be corruption of some sort. Without a watchpoint, that'll be hard to track down - you need to check every single stack variable to ensure there's no possibility of overrun.
I'm assuming that pointer is a global since you set it and print it "a million miles away". If it is, lok at the variables you define on either side of it (in the source). They're the ones most likely to be causing overrun.
Another possibility is to turn off the optimization to see if the problem still occurs. We've occasionally had to ship code like that in cases where we couldn't fix the bug before deadlines (we'll always go back and fix it later, of course).

Related

How to debug the memory is changed randomly issue

My application is a multi-thread program that runs on Solaris.
Recently, I found it may crash, and the reason is one member in a pointer array is changed from a valid value to NULL,so when accessing it, it crashed.
Because the occurrence ratio is very low, in the past 2 months, it only occurred twice, and the changed members in the array aren't the same. I can't find the repeated steps, and after reviewing code, there is no valuable clue gotten.
Could anyone give some advice on how to debug the memory is changed randomly issue?
Since you aren't able to reproduce the crash, debugging it isn't going to be easy.
However, there are some things you can do:
Go through the code and make a list of all of the places in the code that write to that variable--particularly the ones that could write a NULL to it. It's likely that one of them is your culprit.
Try to develop some kind of torture test that makes the fault more likely to occur (eg running through simulated or random transactions at top speed). If you can reproduce the crash this way you'll be in a much better situation, as you can then analyze the actual cause of the crash instead of just speculating.
If possible, run the program under valgrind or purify or similar. If they give any warnings, track down what is causing those warnings and fix it; it's possible that your program is eg accessing memory that has been freed, which might seem to work most of the time (if the free memory hasn't been reused for anything when it is accessed) but would fail occasionally (when something is reusing it)
Add a memory checker like Electric Fence to your code, or just replace free() with a custom version that overwrites the free memory with random garbage in the hopes that this will make the crash more likely to occur.
Recompile your program using different compilers (especially new/fancy ones like clang++ with the static analyzer enabled) and fix whatever they warn about. This may point you to your problem.
Run the program under different hardware and OS's; sometimes an obscure problem under one OS gives really obvious symptoms on another.
Review the various machines where the crash is known to have occurred. Do they all have anything in common? What about the machines where it hasn't crashed? Is there something different about them?
Step 2 is really the most important one, because even if you think you have fixed the problem, you won't be able to prove it unless you can reproduce the crash in the old code, and cannot reproduce it with the fixed code. Without being able to reproduce the fault, you're just guessing about whether a particular code change actually helps or not.

Float value suddenly becoming huge

I would rather not dump code, but explain my problem. After hours of debugging I managed to understand that at some point in my code, a float value that is not explicitly modified turns HUGE (more than 1e15). I do use a lot of memory in my program (a string array containing 800+ words), other than that though, I have no idea what could cause this.
If anyone has any ideas regarding this, please share. Otherwise, I'll post a pastebin of the
code soon.
EDIT:
Here is the code: http://pastebin.com/vgiZweNq. The problem rests in the next_generation() function, where the sumfit variable goes nuts at random times in the loop.
Also, I've compiled this on linux using -fno-stack-limit and -fstack-check, to avoid stack overflows.
EDIT 2:
I've changed the program to use a dynamically allocated linked list, to further avoid stack overflows. Still, sumfit gets changed to Floatzilla at random points, usually pretty early on.
Cheers!
Since the variable is obviously being modified from an unexpected point, you might want to check some possibilities:
Is it being modified from a different thread or from an interrupt / event handler? If so, is the access properly synchronized to prevent a data race?
Are you doing pointer arithmetic that might be buggy and cause access outside the intended buffer?
Are you casting pointers between types of different sizes?
Especially if you are working on an embedded device: Maybe the memory is full and your stack is overlapping the heap, or the global variables.
More information about the platform this happens on would be helpful.
You're using strcpy on the chrom array, but i don't see where they ever get null terminated.
Maybe I'm just missing it, though.
You've got a huge string array. I reckon you're probably going off the end of it. Keep track of the size of data going into that array.

All my local variables are deleted

In my C program, after I call a function, all the variables in the outer function are disappearing. The program no longer recognizes that they exist, and trying to access them causes an error.
void outer_function()
{
int x = 0;
inner_function();
printf("%d\n", x); // Throws an error because x does not exist
}
I'm not sure what in inner_function() is causing it, and the function is too long to paste here. What sort of behavior could cause the local variables in outer_function() to disappear? The only thing I can think of is that inner_function() is writing over outer_function()'s memory, but it seems like that would only change the contents of the variables, not delete them.
Edit: I don't think there's really a whole lot more I can tell you. gcc said EXC_BAD_ACCESS and then "warning: Unable to restore previously selected frame," and then crashed. I know it's difficult for you to say what's actually causing it without seeing the whole function, which is why I initially just asked what sort of bug could cause behavior like this.
Without seeing a complete, compilable code snippet, it's impossible to say. The only thing I can think of is that inner_function() is actually some perverse macro that's screwing things up.
Are you 100% sure that printf("%d\n", x); is the line that is causing the error? Have you stepped through this? I would add some lines to print the output of x before, during, and after the inner_function() to see exactly where the problem lies. I have a feeling that you have a problem inside the inner_function().
Once you enter the realm of undefined behaviour all bets are off, so if there is any undefined behaviour at all inside inner_function() the subsequent behaviour of your entire program and hence outer_function() is also undefined.
Maybe you declare and define inner_function in different ways (cdecl and stdcall).
Though you should still go back and edit your question to add some information about how your program is failing and what "local variables are being deleted" actually means, this is the type of thing that could cause a program to lost the value of a variable from a different scope.
void inner_function(void) {
int x[1];
memset(x, 0, 10 * sizeof(x));
}
This should actually fail when the function tries to return. This is called a buffer overflow because you have a buffer (a range of memory used to hold something) that you have permission (from the C programming language) to edit, but you edit that and a lot more. That "a lot more" data is other memory that the compiler expected that you would not edit like the return address and variables in other scopes.
This example is a very general case and it is intended to be easily understood, but it is very likely that if your inner_function does suffer from this type of error it won't be as clear as this. It is also possible to make a buffer overflow that does not overwrite the return value, so that inner_function would return without failing, but then you might find local variables from outer_function changed (which is what I think you were saying is happening in your code), but to write a usable example of this on purpose I would need to know a lot more about what platform, compiler, and compiler options you were using so that the I could make educated guesses about where on the stack, relative to the top of the stack (which is the current function's stack frame) things would probably be.

Need help with buffer overrun

I've got a buffer overrun I absolutely can't see to figure out (in C). First of all, it only happens maybe 10% of the time or so. The data that it is pulling from the DB each time doesn't seem to be all that much different between executions... at least not different enough for me to find any discernible pattern as to when it happens. The exact message from Visual Studio is this:
A buffer overrun has occurred in
hub.exe which has corrupted the
program's internal state. Press
Break to debug the program or Continue
to terminate the program.
For more details please see Help topic
'How to debug Buffer Overrun Issues'.
If I debug, I find that it is broken in __report_gsfailure() which I'm pretty sure is from the /GS flag on the compiler and also signifies that this is an overrun on the stack rather than the heap. I can also see the function it threw this on as it was leaving, but I can't see anything in there that would cause this behavior, the function has also existed for a long time (10+ years, albeit with some minor modifications) and as far as I know, this has never happened.
I'd post the code of the function, but it's decently long and references a lot of proprietary functions/variables/etc.
I'm basically just looking for either some idea of what I should be looking for that I haven't or perhaps some tools that may help. Unfortunately, nearly every tool I've found only helps with debugging overruns on the heap, and unless I'm mistaken, this is on the stack. Thanks in advance.
You could try putting some local variables on either end of the buffer, or even sentinels into the (slightly expanded) buffer itself, and trigger a breakpoint if those values aren't what you think they should be. Obviously, using a pattern that is not likely in the data would be a good idea.
While it won't help you in Windows, Valgrind is by far the best tool for detecting bad memory behavior.
If you are debugging the stack, your need to get to low level tools - place a canary in the stack frame (perhaps a buffer filled with something like 0xA5) around any potential suspects. Run the program in a debugger and see which canaries are no longer the right size and contain the right contents. You will gobble up a large chunk of stack doing this, but it may help you spot exactly what is occurring.
One thing I have done in the past to help narrow down a mystery bug like this was to create a variable with global visibility named checkpoint. Inside the culprit function, I set checkpoint = 0; as the very first line. Then, I added ++checkpoint; statements before and after function calls or memory operations that I even remotely suspected might be able to cause an out-of-bounds memory reference (plus peppering the rest of the code so that I had a checkpoint at least every 10 lines or so). When your program crashes, the value of checkpoint will narrow down the range you need to focus on to a handful of lines of code. This may be a bit overkill, I do this sort of thing on embedded systems (where tools like valgrind can't be used) but it should still be useful.
Wrap it in an exception handler and dump out useful information when it occurs.
Does this program recurse at all? If so, I check there to ensure you don't have an infinite recursion bug. If you can't see it manually, sometimes you can catch it in the debugger by pausing frequently and observing the stack.

c runtime error message

this error appeared while creating file using fopen in c programming language
the NTVDM cpu has encountered an illegal instruction CS:0000 IP0075
OP:f0 00 f0 37 05 choos 'close to terminate the operation
This kind of thing typically happens when a program tries to execute data as code. In turn, this typically happens when something tramples the stack and overwrites a return address.
In this case, I would guess that "IP0075" is the instruction pointer, and that the illegal instructions executed were at address 0x0075. My bet is that this address is NOT mapped to the apps executable code.
UPDATE on the possible connection with 'fopen': The OP states that deleting the fopen code makes the problem go away. Unfortunately, this does not prove that the fopen code is the cause of the problem. For example:
The deleted code may include extra local variables, which may mean that the stack trampling is hitting the return address in one case ... and in the other case, some word that is not going to be used.
The deleted code may cause the size of the code segment to change, causing some significant address to point somewhere else.
The problem is almost certainly that your application has done something that has "undefined behavior" per the C standard. Anything can happen, and the chances are that it won't make any sense.
Debugging this kind of problem can be really hard. You should probably start by running "lint" or the equivalent over your code and fixing all of the warnings. Next, you should probably use a good debugger and single step the application to try to find where it is jumping to the bad code/address. Then work back to figure out what caused it to happen.
Assuming that it's really the fopen() call that causes problems (it's hard to say without your source code), have you checked that the 2 character pointers that you pass to the function are actually pointers to a correctly allocated memory?
Maybe they are not properly initialized?
Hmmm.... you did mention NTVDM which sounds like an old 16 bit application that crashed inside an old command window with application compatibility set, somehow. As no code was posted, it could be possible to gauge a guess that its something to do with files (but fopen - how do you know that without showing a hint?) Perhaps there was a particular file that is longer than the conventional 8.3 DOS filename convention and it borked when attempting to read it or that the 16 bit application is running inside a folder that has again, name longer than 8.3?
Hope this helps,
Best regards,
Tom.

Resources