Memory read failed for 0x0 in a game loop, need a proper debugging method - c

My program crashes somewhere during a game loop at variable times independent of input events. I am looking for a suitable debugging method that I can use to find the problem in my code.
When I fire up my lldb debugger against the executable, run it without set breakpoints, I get the following output after some time, when the program crashes:
Process 86823 stopped
* thread #2, queue = 'com.apple.libdispatch-manager', stop reason = EXC_BAD_ACCESS (code=1, address=0x1)
frame #0: 0x0000000000000001
error: memory read failed for 0x0
Target 0: (smoke) stopped.
This tells me, that I am trying to read something from a bad memory address. The issue is, I am not sure how to pinpoint that in my code, as this may or may not occur in a given cycle of a game loop. So just setting a breakpoint is problematic. What is the proper debugging method in this case?

Related

C malloc "can't allocate region" error, but can't repro with GDB?

How can I debug a C application that does not crash when attached with gdb and run inside of gdb?
It crashes consistently when run standalone - even the same debug build!
A few of us are getting this error with a C program written for BSD/Linux, and we are compiling on macOS with OpenSSL.
app(37457,0x7000017c7000) malloc: *** mach_vm_map(size=13835058055282167808) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
ERROR: malloc(buf->length + 1) failed!
I know, not helpful.
Recompiling the application with -g -rdynamic gives the same error. Ok, so now we know it isn't because of a release build as it continues to fail.
It works when running within a gdb debugging session though!!
$ sudo gdb app
(gdb) b malloc_error_break
Function "malloc_error_break" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc_error_break) pending.
(gdb) run -threads 8
Starting program: ~/code/app/app -threads 8
[New Thread 0x1903 of process 45436]
warning: unhandled dyld version (15)
And it runs for hours. CTRL-C, and run ./app -threads 8 and it crashes after a second or two (a few million iterations).
Obviously there's an issue within one of the threads. But those workers for the threads are pretty big (a few hundred lines of code). Nothing stands out.
Note that the threads iterate over loops of about 20 million per second.
macOS 10.12.3
Homebrew w/GNU gcc and openssl (linking to crypto)
Ps, not familiar with C too much - especially any type of debugging. Be kind and expressive/verbose in answers. :)
One debugging technique that is sometimes overlooked is to include debug prints in the code, of course it has it's disadvantages, but also it has advantages. A thing you must keep in mind though in the face of abnormal termination is to make sure the printouts actually get printed. Often it's enough to print to stderr (but if that doesn't make the trick one may need to fflush the stream explicitly).
Another trick is to stop the program before the error occurs. This requires you to know when the program is about to crash, preferably as close as possible. You do this by using raise:
raise(SIGSTOP);
This does not terminate the program, it just suspends execution. Now you can attach with gdb using the command gdb <program-name> <pid> (use ps to find the pid of the process). Now in gdb you have to tell it to ignore SIGSTOP:
> handle SIGSTOP ignore
Then you can set break-points. You can also step out of the raise function using the finish command (may have to be issued multiple times to return to your code).
This technique makes the program have normal behaviour up to the time you decide to stop it, hopefully the final part when running under gdb would not alter the behavior enuogh.
A third option is to use valgrind. Normally when you see these kind of errors there's errors involved that valgrind will pick up. These are accesses out of range and uninitialized variables.
Many memory managers initialise memory to a known bad value to expose problems like this (e.g. Microsoft's CRT will use a range of values (0xCD means uninitialised, 0xDD means already free etc).
After each use of malloc, try memset'ing the memory to 0xCD (or some other constant value). This will allow you to identify uninitialised memory more easily with the debugger. don't use 0x00 as this is a 'normal' value and will be harder to spot if it's wrong (it will also probably 'fix' your problem).
Something like:
void *memory = malloc(sizeof(my_object));
memset(memory, 0xCD, sizeof(my_object));
If you know the size of the blocks, you could do something similar before free (this is sometimes harder unless you know the size of your objects, or track it in some way):
memset(memory, 0xDD, sizeof(my_object));
free(memory);

debugging C program with gdb

I'm trying to test a scheduler that I wrote. I schedule two processes - both are infinite while loops (just while(1) statements). When I run the program sometimes it segfaults after like ten seconds (sometimes 5 sec, sometimes 15 or more). Sometimes it doesn't segfault at all and runs as expected. I have a log file which shows me that both processes are scheduled as expected before the segfault occurs. I'm trying to debug the errors using gdb but it's not being very helpful. Here's what I got with backtrace:
#0 0x00007ffff7ff1000 in ?? ()
#1 0x000000000000002b in ?? ()
#2 0x00007ffff78b984a in new_do_write () from /lib64/libc.so.6
#3 0x000000000061e3d0 in ?? ()
#4 0x0000000000000000 in ?? ()
I don't really understand #2.
I think this may be a stack overflow related error. However, I only malloc twice in the whole process - both times when I'm setting up the two processes, I malloc a pcb block in the pcb table I wrote. Has anyone run into similar issues before? Could this be something with how I'm setting/swapping the contexts in the scheduler? Why does it segfault sometimes, and sometimes not?
You didn't tell how you obtained the stack trace that you show in the question.
It is very likely that the stack trace is bogus not because the stack is corrupt, but because you've invoked GDB incorrectly, e.g. specified wrong executable when attaching the process or examining core dump.
One common mistake is to build the executable with -O2 (let's call this executable E1), then rebuild it with -g (let's call this E2) and try to analyze core of live process that is running E1 giving GDB E2 as the symbol file.
Don't do that, it doesn't work and isn't expected to work.
Since your stack seems corrupted, you're probably correct that you have a stack buffer overflow somewhere. Without the code, it's a little difficult to tell.
But this has nothing to do with your malloc calls. Overflowing dynamically allocated buffers would corrupt the heap, not the stack.
Whay you'll probably need to be looking at is local variables that aren't big enough for the data you're trying to copy in to them, like:
char xyzzy[5];
strcpy (xyzzy, "this is a bad idea";
Or passing a buffer (again, most likely on the stack) to a system call that writes more data to it than you provide for.
They're the most likely causes though theoretically, of course, any undefined behaviour on your part could cause this. If the solution is not evident based on this answer, you'll probably need to post the code that caused it. Try to ensure you trim it down as much as possible when you do that so that it's the shortest complete program that exhibits the bug.
Often you'll find by doing that, the problem becomes evident :-)

How to debug a pointer being overwritten?

I am having trouble with a bug caused by overwriting a pointer with an invalid value. I have not been able to find the bug using valgrind (in it's default mode) or with GDB because they only point me to the invalid pointer, and NOT what overwrote that pointer to the incorrect value.
It's always the same variable, however, I do not explicitly set it to the bad value. Some other line in the program must be accessing memory out of it's bounds but by chance it happens to hit the storage for this pointer instead.
I am unsure what debugging tools/options I should use to approach this bug.
Example crash:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6ffc700 (LWP 2425)]
0x00000000004058b2 in writeToConn (conn=0x7ffff0004f40) at streamHandling.c:115
115 ssize_t result = send(conn->fd, conn->head->data->string + position, conn->head->data->size - position, 0);
(gdb) print conn
$1 = (struct connection *) 0x7ffff0004f40
(gdb) print conn->head->data
$2 = (struct dbstring *) 0x35
Unfortunately I can't simply watch the variable conn->head->data because I have about 5,000 conn structs.
This code works most of the time, however if run under a moderately heavy load it will crash after a few seconds.
You can have gdb automatically execute commands when a breakpoint is triggered, with Break Commands.
You could set up a Break Command to run whenever a struct connection is allocated, and have it add a watchpoint on the field of interest.
Would a stack backtrace help? Here is a page that tells how to do it.
How can one grab a stack trace in C?

gdb watch huge amount of memory to find out corruption, no seg fault here

Updated:
now with valgrind --tools=memcheck --track-origins=yes --leak-check=full ./prog it runs correctly, but without this valgrind, it still goes wrong, how's that happen?
I'm doing a project on Linux, which stores lots of data in memory, and I need to know which data block is changed in order to find out the problem in my program.
Updated: This is a multithread program, and the write/read is done by different threads which created by system calls.
The code is like this
for(j=0;j<save_size;j++){
e->blkmap_mem[blk_offset+save_offset + j] = get_mfs_hash_block();
memcpy(e->blkmap_mem[blk_offset + save_offset +j]->data, (char *)buff + j * 4096, 4096);
e->blkmap_mem[save_offset+j]->data = (char *)(buff + j* 4096);
e->blkmap_mem[blk_offset+save_offset + j]->size = 4096;
e->blkmap_addr[blk_offset+save_offset + j] = 1;
And I want to know if e->blkmap_mem[blk_offset+save_offset+j]->data is changed in somewhere else.
I know awatch exp in gdb could check if the value changes, but there are too many here, is there some way to trace them all, I mean they may be nearly 6,000.
Thanks your guys.
Reverse debugging has a great use case here, assuming you have some way to detect the corruption once it's happened (a seg fault will do fine).
Once you've detected the corruption in a debugging session, you put a watch point on the corrupted variable, and then run the program backwards until the variable was written to.
Here's a step-by-step guide:
Compile the program with debugging symbols as usual and load it into gdb.
Start the program using start.
This puts a breakpoint at the very beginning of main, and runs the program until it hits it.
Now, put a breakpoint somewhere where memory corruption is detected
You don't need to do this if you're detecting the corruption with a seg fault.
type record to start recording program execution
This is why we called start before - you can't record when there's no process running.
continue to set the program running again.
While recording, the program will run very slowly
It may tell you the record buffer is full - if this happens, tell it to wrap around.
When your corruption is detected by your breakpoint or the seg fault, the program will stop. Now put a watch on whatever the corrupted variable is.
reverse-continue to run the program backwards until the corrupted variable is written to.
When the watchpoint hits, you've found your corruption.
Note that it's not always the first or only corruption of that variable. But you can always keep running backwards until you run out of reverse execution history - and now you've got something to fix.
There's a useful tutorial here, which also discusses how to control the size of the record buffer, in case that becomes an issue for you.

fatal error disappeared when running with gdb

I have a program which produces a fatal error with a testcase, and I can locate the problem by reading the log and the stack trace of the fatal - it turns out that there is a read operation upon a null pointer.
But when I try to attach gdb to it and set a breakpoint around the suspicious code, the null pointer just cannot be observed! The program works smoothly without any error.
This is a single-process, single-thread program, I didn't experience this kind of thing before. Can anyone give me some comments? Thanks.
Appended: I also tried to call pause() syscall before the fatal-trigger code, and expected to make the program sleep before fatal point and then attach the gdb on it on-the-fly, sadly, no fatal occurred.
It's only guesswork without looking at the code, but debuggers sometimes do this:
They initialize certain stuff for you
The timing of the operations is changed
I don't have a quote on GDB, but I do have one on valgrind (granted the two do wildly different things..)
My program crashes normally, but doesn't under Valgrind, or vice versa. What's happening?
When a program runs under Valgrind,
its environment is slightly different
to when it runs natively. For example,
the memory layout is different, and
the way that threads are scheduled is
different.
Same would go for GDB.
Most of the time this doesn't make any
difference, but it can, particularly
if your program is buggy.
So the true problem is likely in your program.
There can be several things happening.. The timing of the application can be changed, so if it's a multi threaded application it is possible that you for example first set the ready flag and then copy the data into the buffer, without debugger attached the other thread might access the buffer before the buffer is filled or some pointer is set.
It's could also be possible that some application has anti-debug functionality. Maybe the piece of code is never touched when running inside a debugger.
One way to analyze it is with a core dump. Which you can create by ulimit -c unlimited then start the application and when the core is dumped you could load it into gdb with gdb ./application ./core You can find a useful write-up here: http://www.ffnn.nl/pages/articles/linux/gdb-gnu-debugger-intro.php
If it is an invalid read on a pointer, then unpredictable behaviour is possible. Since you already know what is causing the fault, you should get rid of it asap. In general, expect the unexpected when dealing with faulty pointer operations.

Resources