How to avoid "(null)" StackTrace in DPH_BLOCK_INFORMATION? - heap-memory

I'm having a blast tracking down some heap corruption. I've enabled standard page heap verification with
gflags /p /enable myprogram.exe
and this succeeds in confirming the corruption:
===========================================================
VERIFIER STOP 00000008: pid 0x1040: corrupted suffix pattern
10C61000 : Heap handle
19BE0CF8 : Heap block
00000010 : Block size
00000000 :
===========================================================
When I turn on full page heap verification (gflags /p /enable myprogram.exe /full) in anticipation that this will cause an error to occur at the time the corruption is introduced, I get nothing more.
I started to get my hopes up while reading Advanced Windows Debugging: Memory Corruption Part II—Heaps, which is a chapter from Advanced Windows Debugging. I installed WinDbg, and downloaded debug symbols for user32.dll, kernel32.dll, ntdll.dll according to http://support.microsoft.com/kb/311503. Now when the program halts in the debugger I can issue this command to see information about the heap page:
0:000> dt _DPH_BLOCK_INFORMATION 19BE0CF8-0x20
ntdll!_DPH_BLOCK_INFORMATION
+0x000 StartStamp : 0xabcdaaaa
+0x004 Heap : 0x90c61000
+0x008 RequestedSize : 0x10
+0x00c ActualSize : 0x38
+0x010 FreeQueue : _LIST_ENTRY [ 0x0 - 0x0 ]
+0x010 TraceIndex : 0
+0x018 StackTrace : (null)
+0x01c EndStamp : 0xdcbaaaaa
I am dismayed by the (null) stack trace. Now, http://msdn.microsoft.com/en-us/library/ms220938%28VS.80%29.aspx says:
The StackTrace field will not always contain a non-null value for various reasons. First of all stack trace detection is supported only on x86 platforms and second, even on x86 machines the stack trace detection algorithms are not completely reliable. If the block is an allocated block the stack trace is for the allocation moment. If the block was freed, the stack trace is for the free moment.
But I wonder if anyone has any thoughts on increasing the chances of seeing the stack trace from the allocation moment.
Thanks for reading!

Ah ha! Turns out I needed to enable more gflags options:
gflags /i myprogram.exe +ust
Which has this effect:
ust - Create user mode stack trace database
Seems straightforward when I see parameter description. Silly me. But I also seem to need to set the size of the trace database before it will take effect:
gflags /i myprogram.exe /tracedb 512
...or whatever (in MB).

According to Microsoft, the malloc function in the C run-time (CRT) module uses the frame pointer omission (FPO) in some Windows versions. You may not see the complete stack information of the malloc function. (http://support.microsoft.com/kb/268343)
If possible, try to link the debug version CRT, e.g. link with /MDd option, to solve this issue.

Related

how to set a breakpoint malloc_error_break lldb

When i run my program, I have this error :
a.out(56815,0x10bb2c5c0) malloc: *** error for object 0x7fe5ea402a90: pointer being freed was not allocated
a.out(56815,0x10bb2c5c0) malloc: *** set a breakpoint in malloc_error_break to debug
how i can set a breakpoint to malloc_error_break with lldb ?
thanks for the help
(lldb) break set -n malloc_error_break
is the lldb command for setting a breakpoint by symbol name.
That breakpoint will stop you at the point where the errant free occurred, which is some information. It may be for instance that you have a code path where you don't initialize something that you later free. The backtrace when you hit malloc_error_break will show you what you are trying to free, so you can trace back why it didn't get initialized.
If the problem ends up being more complicated (a double-free for instance), that's going to be a little harder to track down from the stop point, since you can't tell from there where the first free was. In that case, it's definitely worthwhile to rebuild your program with ASAN enabled and run it again in the debugger. In the case of double free's and such-like since ASAN records the whole malloc history of the program, it can tell you every time the errant pointer was handled by the malloc system, making the error easier to spot. It also does a bunch of pre-flight checks and can often catch memory errors early on. And at a surprisingly small performance cost as well...
There's more on ASAN here:
https://clang.llvm.org/docs/AddressSanitizer.html
If it's too difficult to rebuild with ASAN you can use the MallocStackLoggingNoCompact environment variable and then use the malloc_history program to print the errors. There's more info about that in the "malloc" manpage, e.g.:
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc.3.html
But if you are planning to do more than a little development, it's a good idea to get familiar with ASAN. It makes tracking down allocation & deallocation errors much easier.

How to make valgrind ignore certain line?

for example
==26460== 2 bytes in 1 blocks are still reachable in loss record 2 of 105
==26460== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==26460== by 0x580D889: strdup (in /usr/lib64/libc-2.17.so)
==26460== by 0x4F50AF: init (init.c:468)
==26460== by 0x406D75: main (main.c:825)
I want to not check init.c:468: mode = strdup, i'm sure this only malloc once, and will last whole process life.
Is it possible to make valgrind not check this line?
As I said in my comment: I recommend not to.
But Valgrind does have a feature to suppress warnings.
The most convenient way of suppressing a specific message is supported by the feature dedicated to exactly that purpose:
--gen-suppressions=yes
Which apparently will ouptput the precise suppression syntax for each/any generated message.
See 5.1 in the FAQ:
http://valgrind.org/docs/manual/faq.html#faq.writesupp
(I love their style:
"F:Can you write ... for me?" and I expected a totally adequate
"A:No." But they actually answer
"A: Yes ...". Beyond cool.)
You should fix the leaks; it is far better to do so.
You can't stop Valgrind checking for the leaks, but you can stop it reporting them by suppressing the leaks.
Use:
valgrind --gen-suppressions=yes --leak-check=all -- tested-program …
You can then save the suppressions in a file, say tp.suppressions, and subsequently you use:
valgrind --suppressions=tp.suppressions -- tested-program …
If you work on a Mac like I do, and work with bleeding edge systems, you'll often find it necessary to suppress leaks from the system startup code — memory that's allocated before main() is called and which you cannot therefore control.
OTOH, it is routine that after the new release of macOS, it takes a while to get Valgrind running again. I upgraded to macOS High Sierra 10.13; Valgrind has stopped working again because the kernel isn't recognized.

Valgrind mmap error 22

I am trying to run a program on Valgrind. But I am getting this error:
valgrind: mmap(0x67d000, 1978638336) failed in UME with error 22 (Invalid argument).
valgrind: this can be caused by executables with very large text, data or bss segments.
I am unsure what the issue is. I know that I have plenty of memory (I am running on a server with 500+ GB of ram). Is there a way of making this work?
Edit: Here are my program and machine details:
So my machine (it is a server for research purposes) has this much RAM:
$ free -mt
total used free shared buff/cache available
Mem: 515995 8750 162704 29 344540 506015
Swap: 524277 762 523515
Total: 1040273 9513 686219
And the program (named Tardis) size info:
$ size tardis
text data bss dec hex filename
509180 2920 6273605188 6274117288 175f76ea8 tardis
Unfortunately there is no easy answer to this. The Valgrind host has to load its text somewhere (and also put its heap and stack somewhere). There will always be conflicts with some guest applications.
It would be nice if we could have an argument like --host-text-address=0x68000000. That's not possible as the link editor writes it into the binary. It isn't possible to change this with ld.so. The only way to change it is to rebuild Valgrind with a different value. The danger then is that you get new conflicts.

Why code in stack or heap segment can be executed?

In the security field, there are heap exploitation and stack smashing attack.
But I found that /proc/*/maps file, the heap and stack segment,
only have rw-p-permission.
There is no execution permission in the these two segments.
My engineer friends told me that if you have rw permission in the Intel CPU, your code will got the execution permission automatically.
But I can not understand why Intel do this design?
That is because all segments in Linux (Windows also) have the same base address and the same size. Code is always accessed via code segment and code segment covers exactly the same area as stack (or any other) segment, so you can execute code wherever it is.
EDIT:
you can read more here: http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf
Chapter 3.2 USING SEGMENTS

Debug stack overruns in kernel modules

I am working on a driver code, which is causing stack overrun
issues and memory corruption. Presently running the module gives,
"Exception stack" and the stack trace looks corrupted.
The module had compile warnings. The warnings were resolved
with gcc option "-WFrame-larger-than=len".
The issue is possibly being caused by excessive in-lining and lots of
function arguments and large number of nested functions. I need to continue
testing and continue re-factoring the code, is it possible to
make any modifications kernel to increase the stack size ? Also how would you go about debugging such issues.
Though your module would compile with warnings with "-WFrame-larger-than=len", it would still cause the stack overrun and could corrupt the in-core data structures, leading the system to an inconsistency state.
The Linux kernel stack size was limited to the 8KiB (in kernel versions earlier before 3.18), and now 16KiB (for the versions later than 3.18). There is a recent commit due to lots of issues in virtio and qemu-kvm, kernel stack has been extended to 16KiB.
Now if you want to increase stack size to 32KiB, then you would need to recompile the kernel, after making the following change in the kernel source file:(arch/x86/include/asm/page_64_types.h)
// for 32K stack
- #define THREAD_SIZE_ORDER 2
+ #define THREAD_SIZE_ORDER 3
A recent commit shows on Linux kernel version 3.18, shows the kernel stack size already being increased to 16K, which should be enough in most cases.
"
commit 6538b8ea886e472f4431db8ca1d60478f838d14b
Author: Minchan Kim <minchan#kernel.org>
Date: Wed May 28 15:53:59 2014 +0900
x86_64: expand kernel stack to 16K
"
Refer LWN: [RFC 2/2] x86_64: expand kernel stack to 16K
As for debugging such issues there is no single line answer how to, but here are some tips I can share. Use and dump_stack() within your module to get a stack trace in the syslog which really helps in debugging stack related issues.
Use debugfs, turn on the stack depth checking functions with:
# mount -t debugfs nodev /sys/kernel/debug
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
and regularly capture the output of the following files:
# cat /sys/kernel/debug/tracing/stack_max_size
# cat /sys/kernel/debug/tracing/stack_trace
The above files will report the highest stack usage when the module is loaded and tested.
Leave the below command running:
while true; do date; cat /sys/kernel/debug/tracing/stack_max_size;
cat /sys/kernel/debug/tracing/stack_trace; echo ======; sleep 30; done
If you see the stack_max_size value exceeding maybe ~14000 bytes (for 16KiB stack version of the kernel) then the stack trace would be worth capturing looking into further. Also you may want to set-up crash tool to capture vmcore core file in cases of panics.

Resources