How to debug memory issues in embedded application - c

I'm new to embedded programming but I have to debug a quite complex application running on an embedded platform. I use GDB through a JTAG interface.
My program crashes at some point in an unexpected way. I suppose this happens due to some memory related issue. Does GDB allow me to inspect the memory after the system has crashed, thus being completely unresponsive?

It depends on your setup a bit. In particular, since you're using JTAG, you may be able to set your debugger up to halt the processor when it detects an exception (for example accessing protected memory illegally and so forth). If not, you can replace your exception handlers with infinite loops. Then you can manually unroll the exception to see what the processor was doing that caused the crash. Normally, you'll still have access to memory in that situation and you can either use GDB to look around directly, or just dump everything to a file so you can look around later.

It depends on what has crashed. If the system is only unresponsive (in some infinite loop, deadlock or similar), then it will normally respond to GDB and you will be able to see a backtrace (call stack), etc.
If the system/bus/cpu has actually crashed (on lower level), then it probably will not respond. In this case you can try setting breakpoints at suspicious places/variables and observe what is happening. Also simulator (ISS, RTL - if applicable) could come handy, to compare behavior with HW.

Related

Qemu: trace MMU operation

Currently I'm trying to run qemu-system-arm for armv7 architecture, do some initial setup for paging and then enable MMU.
I run qemu with gdb stub and connect to it then with gdb.
I must have screwed something up with translation tables/registers/etc., the thing is the minute I set MMU-enable bit in control register, gdb can't fetch data from memory anymore: after ni command which executes mmu-enable instruction it doesn't fetch next command and I can't access memory.
Is there any way to look what happens inside Qemu's MMU? Where it takes translation tables from, what calculates etc.
Or should I just recompile it with my additional debug output?
No, there's no way to trace this without modifying QEMU's sources yourself
So I did. For ARM architecture, the relevant code is found in target-arm/helper.c - get_phys_addr* functions.
No, there's no way to trace this without modifying QEMU's sources yourself to add debugging output. This is a specific case of a more general tendency, which is that QEMU's design and approach is largely "run correct code quickly", not to provide detailed introspection into the behaviour of possibly buggy guests. Sometimes, as in this case, there's a nice easy location to add your own debug printing; sometimes, as in the fairly common desire to print all the memory accesses made by the guest, there is nowhere in the C code where tracing can be put to catch all accesses.
When I used QEMU for debugging VM issues in an operating system kernel that I had built with a colleague, I ended up connecting GDB to debug QEMU instead (instead of debugging the guest process inside QEMU).
You can place breakpoints on the MMU table walking function and step through it.

What Can Cause a C Program to Crash Operating System

I recently found that a fairly large image manipulation program I'm writing in C on a Windows 8 machine has a bug when used in very particular circumstances. Unfortunately, the bug is causing my entire computer to come to a standstill so that my only option is to pull the plug on the computer (especially annoying when I'm working remotely...)
Because it's an image manipulation program, I can't just flood it with print statements to isolate the problematic section - the problem occurs somewhere in a loop that's called billions of times, so adding a printf slows it down to the point that it would take days to get to a failing iteration.
I understand, therefore, if this question is too broad, as it isn't really reasonable for me to put down all of the code that could cause my problem, I'm simply asking
What are the circumstances in which C code can, instead of seg faulting or halting the program, actually freeze the entire OS
When I search the problem, I see code golf questions like this
A C program which crashes the system(shuts down the system)
This is not what I'm asking - obviously I haven't written system("shutdown") anywhere in my loop.
Being most familiar with python and java, this problem is not what I'm used to, but in my experience,
Dividing by zero produces a seg fault
Accessing memory by accident that is slightly outside an intended array causes a seg fault (sometimes down the road a little)
Accessing protected memory causes the program to hang
Stack overflow causes a seg fault
Dereferencing a non-initialized pointer causes a seg fault
Is this impression false - could those cases cause the whole system to crash? What cases am I missing? Is it dependent on my version of gcc, or my permission status?
I haven't been able to try to reproduce it on a different operating system yet, as it requires a few dependencies to run the entire program.
If my only option is to sit for days waiting for the program to run with print statements, or avoid weird situations, then, of course, so be it. I'm looking for key places to look for the bug.
On modern systems with hardware-enforced privilege separation between user-mode and kernel-mode, and an operating system that functions to correctly configure these mechanisms, you simply cannot crash the system from a user mode process.
Any of those errors are trapped by the CPU, which call exception handlers in the OS which will quickly pull the plug on your system.
If I had to guess, a piece of hardware is overheating or malfunctioning:
Overheating CPU due to poor thermal conductivity with heatsink
Failing / under-sized power supply
Failing DIMMs
Failing hard drive
Failing CPU
Failing / overheating GPU
I've seen cryptocoin-mining software bring a system to its knees because it was pushing the limits of the GPU. When the card would lock-up/reset, the driver would get confused or lock-up, and the system would end up needed rebooted.
Your system is doing next to nothing when you're just sitting there browsing the web, etc. But if your system locks up when you start running a CPU-intensive application, it can bring out problems that you didn't know where there.
While this is a little out-of-place on Stack Overflow, it falls into one of those grey areas between hardware and software. I would stress-test your system, keeping an eye on CPU/GPU/memory temperatures, and power supply voltages. Check out MemTest86, Stresslinux.
The most trivial cause of OS freezing is "memory full". If you have processes that use a lot of memory, then your system is going to swap from main memory (typically RAM) to secondary memory (typically disk) which lead to a very huge overhead... As a user what you usually observe is a almost freezed computer, sometimes so freezed that you think it is crashed. If your OS is badly designed then it sometimes crashes!

Create a Debugger using C

I have been asked to write a program in C which should debug another program in C language and then store the value of each variable of every line,loop or function in a log file.
I have been searching over the internet and I found articles on debugging using gdb.
Can I somehow use GDB in my program for this purpose and then store the values of each variable line by line.
I've got basic knowledge of C/C++ so please reply in simple terms.
Thanks
Debuggers depend on some special capability of the hardware, which must be exposed by the operating system (if any).
The basic idea is that the hardware is configured to transfer control to a debugger stub either after every instruction of the target program, or after certain types of instructions such system calls, or those meeting a hardware breakpoint condition. Typically this would look like an interrupt, supervisor exception, or the like - very platform-specific detail.
As mentioned in comments, on Linux, you use the ptrace functionality of the kernel to interact with the debugger support provided by the hardware and the kernel, abstracting away a lot of the hardware-unique detail and managing the permission issues. Typically you must either be the same user id as the process being debugged, or be the superuser (root). Linux's ptrace also gives you an indirect ability to do to things like access the memory (literally, address space) of the target application, something critical to debugger functionality which you cannot ordinarily do from another user-mode program on a multitasking operating system.
Other operating systems will have different methods. Some embedded targets use debug pods which connect your development machine to the embedded board by a few wires. In other cases, debug capability built into the hardware is managed by a small program running on the target processor, which then talks back over a serial or network port to the full debugger program residing on the development machine.
A program such as GDB can do more than just the basics of setting debug stop conditions, dumping registers, and dumping program instructions. Much of its code deals with annotating what it displays based on debug metadata optionally left behind by compilers, walking back through stack frames, and giving the user powerful tools to configure all of this - and of course it does most of this in a target-independent way, with the target-unique code mostly confined to a few interchangeable directories.
You can indeed "drive" GDB from another program - many, many GUI type debuggers do exactly that, existing as graphical front ends for GDB. However, if you were assigned to write a debugger, doing it that way may or may not by consistent with your assignment.

GDB prevents errors

I got this problem on different C projects when using gdb.
If I run my program without it, it crashes consistently at a given event probably because of a invalid read of the memory. I try debugging it with gdb but when I do so, the crash seems to never occur !
Any idea why this could happen ?
I'm using mingw toolchain on Windows.
Yes, it sounds like a race condition or heap corruption or something else that is usually responsible for Heisenbugs. The problem is that your code is likely not correct at some place, but that the debugger will have to behave even if the debugged application does funny things. This way problems tend to disappear under the debugger. And for race conditions they often won't appear in the first place because some debuggers can only handle one thread at a time and uniformly all debuggers will cause the code to run slower, which may already make race conditions go away.
Try Valgrind on the application. Since you are using MinGW, chances are that your application will compile in an environment where Valgrind can run (even though it doesn't run directly on Windows). I've been using Valgrind for about three years now and it has solved a lot of mysteries quickly. The first thing when I get a crash report on the code I'm working with (which runs on AIX, Solaris, BSDs, Linux, Windows) I'm going to make one test run of the code under Valgrind in x64 and x86 Linux respectively.
Valgrind, and in your particular case its default tool Memcheck, is going to emulate through the code. Whenever you allocate memory it will mark all bytes in that memory as "tainted" until you actually initialize it explicitly. The tainted status of memory bytes will get inherited by memcpy-ing uninitialized memory and will lead to a report from Valgrind as soon as an uninitialized byte is used to make a decision (if, for, while ...). Also, it keeps track of orphaned memory blocks and will report leaks at the end of the run. But that's not all, more tools are part of the Valgrind family and test various aspects of your code, including race conditions between threads (Helgrind, DRD).
Assuming Linux now: make sure that you have all the debug symbols of your supporting libraries installed. Usually those come in the *-debug version of packages or in *-devel. Also, make sure to turn off optimization in your code and include debug symbols. For GCC that's -ggdb -g3 -O0.
Another hint: I've had it that pointer aliasing has caused some grief. Although Valgrind was able to help me track it down, I actually had to do the last step and verify the created code in its disassembly. It turned out that at -O3 the GCC optimizer got ahead of itself and turned a loop copying bytes into a sequence of instructions to copy 8 bytes at once, but assumed alignment. The last part was the problem. The assumption about alignment was wrong. Ever since, we've resorted to building at -O2 - which, as you will see in this Gentoo Wiki article, is not the worst idea. To quote the relevant partÖ
-O3: This is the highest level of optimization possible, and also the riskiest. It will take a longer time to compile your code with this
option, and in fact it should not be used system-wide with gcc 4.x.
The behavior of gcc has changed significantly since version 3.x. In
3.x, -O3 has been shown to lead to marginally faster execution times over -O2, but this is no longer the case with gcc 4.x. Compiling all
your packages with -O3 will result in larger binaries that require
more memory, and will significantly increase the odds of compilation
failure or unexpected program behavior (including errors). The
downsides outweigh the benefits; remember the principle of diminishing
returns. Using -O3 is not recommended for gcc 4.x.
Since you are using GCC in MinGW, I reckon this could well apply to your case as well.
Any idea why this could happen ?
There are several usual reasons:
Your application has multiple threads, has a race condition, and running under GDB affects timing in such a way that the crash no longer happens
Your application has a bug that is affected by memory layout (often reading of uninitialized memory), and the layout changes when running under GDB.
One way to approach this is to let the application trap whatever unhandled exception it is being killed by, print a message, and spin forever. Once in that state, you should be able to attach GDB to the process, and debug from there.
Although it's a bit late, one can read this question's answer in order to be able to set up a system to catch a coredump without using gdb. He may then load the core file using
gdb <path_to_core_file> <path_to_executable_file>
and then issue
thread apply all bt
in gdb.
This will show stack traces for all threads that were running when the application crashed, and one may be able to locate the last function and the corresponding thread that caused the illegal access.
Your application is probably receiving signals and gdb might not pass them on depending on its configuration. You can check this with the info signals or info handle command. It might also help to post a stack trace of the crashed process. The crashed process should generate a core file (if it hasn't been disabled) which can be analyzed with gdb.

Does attaching to a process make it behave differently?

While I am aware of the differences between debug and release builds, I am curious if attaching the debugger to a process (built release or debug) changes that processes behaviour?
For reference, I'm developing on HP 11.31 Itanium but still am curious for the general case.
http://en.wikipedia.org/wiki/Heisenbug#Heisenbug
Of course, attaching a debugger will change the timing (which can change e.g. thread race conditions), and also some system calls can detect if a debugger is attached.
It certainly can, depending on the platform and the method of debugging. For example, when debugging on Windows, there is actually the IsDebuggerPresent function. As noted that function can be circumvented, but then there are other means. So basically, it's complicated.
Yep, lots of things inside the Windows data structures change when a debugger is attached. It changes how memory is allocated/freed, it adds additional housekeeping code and "markers" on the stack (Ever noticed the F00D values in newly allocated memory) in fact many of the changes are used by anti-debuggers to detect if an application is being debugged.
In interpreted languages (Java, .NET) the runtime will often generate different machine instructions when running under a debugger to help it trap and display exceptions, show the original code, etc. It will usually generate unoptimized code as well when a debugger is attached.
Some of these changes affect the way the software behaves and can result complicate transient bugs that are caused by optimizations or extremely fine timinig dependencies.
Yes, I've often found that attaching a debugger to a process instantly makes bugs disappear, only to have them reappear when I compile my app in release mode. Unfortunately I usually can't really ask all my users to open a debugger just to run my app, so it can be quite frustrating.
Another thing to keep in mind is that for multithreaded apps attaching the debugger definitely can yield very different results. These are the kind of things referred to as "Heisenbugs."
Sure, in multithreaded apps, attaching a debugger can yield different result.
However, how about the codes which are not related to threads?
I have seen a release build, which has a debugger attached, doesn't have problems. But, when a debugger is not attached, it has problems.
If it is launched first and a debugger is attached to it, it also shows the same problems.

Resources