I am trying to run a program on Valgrind. But I am getting this error:
valgrind: mmap(0x67d000, 1978638336) failed in UME with error 22 (Invalid argument).
valgrind: this can be caused by executables with very large text, data or bss segments.
I am unsure what the issue is. I know that I have plenty of memory (I am running on a server with 500+ GB of ram). Is there a way of making this work?
Edit: Here are my program and machine details:
So my machine (it is a server for research purposes) has this much RAM:
$ free -mt
total used free shared buff/cache available
Mem: 515995 8750 162704 29 344540 506015
Swap: 524277 762 523515
Total: 1040273 9513 686219
And the program (named Tardis) size info:
$ size tardis
text data bss dec hex filename
509180 2920 6273605188 6274117288 175f76ea8 tardis
Unfortunately there is no easy answer to this. The Valgrind host has to load its text somewhere (and also put its heap and stack somewhere). There will always be conflicts with some guest applications.
It would be nice if we could have an argument like --host-text-address=0x68000000. That's not possible as the link editor writes it into the binary. It isn't possible to change this with ld.so. The only way to change it is to rebuild Valgrind with a different value. The danger then is that you get new conflicts.
Related
I have a core file generated from a segfault. When I try to load it into gdb, it doesn't appear to matter how I load it or if I use the correct executable or not - I always get this warning from gdb about the core file being truncated:
$ gdb -q /u1/dbg/bin/exdoc_usermaint_pdf_compact /tmp/barry/core.exdoc_usermaint.11
Reading symbols from /u1/dbg/bin/exdoc_usermaint_pdf_compact...done.
BFD: Warning: /tmp/barry/core.exdoc_usermaint.11 is truncated: expected core file size >= 43548672, found: 31399936.
warning: core file may not match specified executable file.
Cannot access memory at address 0x7f0ebc833668
(gdb) q
I am concerned with this error:
"BFD: Warning: /tmp/barry/core.exdoc_usermaint.11 is truncated: expected core file size >= 43548672, found: 31399936."
Why does gdb think the core file is truncated? Is gdb right? Where does gdb obtain an expected size for the core file, and can I double-check it?
Background:
I am attempting to improve our diagnosis of segfaults on our production systems. My plan is to take core files from stripped executables in production and use them with debug versions of the executables on our development system, to quickly diagnose segfault bugs. In an earlier version of this question I gave many details related to the similar-but-different systems, but I have since been granted an account on our production system and determined that most of the details were unimportant to the problem.
gdb version:
$ gdb
GNU gdb (GDB) Fedora (7.0.1-50.fc12)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Linux version:
$ uname -a
Linux somehost 2.6.32.23-170.fc12.x86_64 #1 SMP Mon Sep 27 17:23:59 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
I read this question (and many others) before posting. The asker has somewhat similar goals to myself, but is not getting any error from gdb about a truncated core file. Therefore the information related to that question does not help me with my problem.
The Core Dump File Format
On a modern Linux system, core dump files are formatted using the ELF object file format, with a specific configuration.
ELF is a structured binary file format, with file offsets used as references between data chunks in the file.
Core Dump Files
The ELF object file format
For core dump files, the e_type field in the ELF file header will have the value ET_CORE.
Unlike most ELF files, core dump files make all their data available via program headers, and no section headers are present.
You may therefore choose to ignore section headers in calculating the size of the file, if you only need to deal with core files.
Calculating Core Dump File Size
To calculate the ELF file size:
Consider all the chunks in the file:
chunk description (offset + size)
the ELF file header (0 + e_ehsize) (52 for ELF32, 64 for ELF64)
program header table (e_phoff + e_phentsize * e_phnum)
program data chunks (aka "segments") (p_offset + p_filesz)
the section header table (e_shoff + e_shentsize * e_shnum) - not required for core files
the section data chunks - (sh_offset + sh_size) - not required for core files
Eliminate any section headers with a sh_type of SHT_NOBITS, as these are merely present to record the position of data that has been stripped and is no longer present in the file (not required for core files).
Eliminate any chunks of size 0, as they contain no addressable bytes and therefore their file offset is irrelevant.
The end of the file will be the end of the last chunk, which is the maximum of the offset + size for all remaining chunks listed above.
If you find the offsets to the program header or section header tables are past the end of the file, then you will not be able to calculate an expected file size, but you will know the file has been truncated.
Although an ELF file could potentially contain unaddressed regions and be longer than the calculated size, in my limited experience the files have been exactly the size calculated by the above method.
Truncated Core Files
gdb likely performs a calculation similar to the above to calculate the expected core file size.
In short, if gdb says your core file is truncated, it is very likely truncated.
One of the most likely causes for truncated core dump files is the system ulimit. This can be set on a system-wide basis in /etc/security/limits.conf, or on a per-user basis using the ulimit shell command [footnote: I don't know anything about systems other than my own].
Try the command "ulimit -c" to check your effective core file size limit:
$ ulimit -c
unlimited
Also, it's worth noting that gdb doesn't actually refuse to operate because of the truncated core file. gdb still attempts to produce a stack backtrace and in your case only fails when it tries to access data on the stack and finds that the specific memory locations addressed are off the end of the truncated core file.
highlighting an answer from stackoverflow that is used to address similar issue How to get core file greater than 2GB. As per the author the truncate or overwrite issue is resolved by making changes to default /etc/systemd/coredump.conf
I am working on a driver code, which is causing stack overrun
issues and memory corruption. Presently running the module gives,
"Exception stack" and the stack trace looks corrupted.
The module had compile warnings. The warnings were resolved
with gcc option "-WFrame-larger-than=len".
The issue is possibly being caused by excessive in-lining and lots of
function arguments and large number of nested functions. I need to continue
testing and continue re-factoring the code, is it possible to
make any modifications kernel to increase the stack size ? Also how would you go about debugging such issues.
Though your module would compile with warnings with "-WFrame-larger-than=len", it would still cause the stack overrun and could corrupt the in-core data structures, leading the system to an inconsistency state.
The Linux kernel stack size was limited to the 8KiB (in kernel versions earlier before 3.18), and now 16KiB (for the versions later than 3.18). There is a recent commit due to lots of issues in virtio and qemu-kvm, kernel stack has been extended to 16KiB.
Now if you want to increase stack size to 32KiB, then you would need to recompile the kernel, after making the following change in the kernel source file:(arch/x86/include/asm/page_64_types.h)
// for 32K stack
- #define THREAD_SIZE_ORDER 2
+ #define THREAD_SIZE_ORDER 3
A recent commit shows on Linux kernel version 3.18, shows the kernel stack size already being increased to 16K, which should be enough in most cases.
"
commit 6538b8ea886e472f4431db8ca1d60478f838d14b
Author: Minchan Kim <minchan#kernel.org>
Date: Wed May 28 15:53:59 2014 +0900
x86_64: expand kernel stack to 16K
"
Refer LWN: [RFC 2/2] x86_64: expand kernel stack to 16K
As for debugging such issues there is no single line answer how to, but here are some tips I can share. Use and dump_stack() within your module to get a stack trace in the syslog which really helps in debugging stack related issues.
Use debugfs, turn on the stack depth checking functions with:
# mount -t debugfs nodev /sys/kernel/debug
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
and regularly capture the output of the following files:
# cat /sys/kernel/debug/tracing/stack_max_size
# cat /sys/kernel/debug/tracing/stack_trace
The above files will report the highest stack usage when the module is loaded and tested.
Leave the below command running:
while true; do date; cat /sys/kernel/debug/tracing/stack_max_size;
cat /sys/kernel/debug/tracing/stack_trace; echo ======; sleep 30; done
If you see the stack_max_size value exceeding maybe ~14000 bytes (for 16KiB stack version of the kernel) then the stack trace would be worth capturing looking into further. Also you may want to set-up crash tool to capture vmcore core file in cases of panics.
I need to find out the (dynamic) (assembly) instructions and count against my C program. The output I expect is similar to the following
mov 200
pop 130
jne 48
I tried valgrind --tool=callgrind --cache-sim=yes --dump-instr=yes <my program name> and viewed it using Kcahcegrind. I did find the instruction types but count info was no where. I would like to filter the output to discard the instructions which are due to system libraries etc.
I need to find out the address and the size of memory allocated using malloc in some specific functions and parts of my program. I did some heap profiling but it gives the whole heap size. Any suggestion ?
I want to know which memory locations are accessed by a function of my program. In other words I need to find out the memory access pattern of my program. Will counting Loads help? if yes then how can I count Loads ?
take a look at objdump:
http://sourceware.org/binutils/docs/binutils/objdump.html
I'd get started with objdump -S myprog
I have a quite big driver module that I am trying to compile for a recent Linux kernel (3.4.4). I can successfully compile and insmod the same module with a 2.6.27.25 kernel.
GCC version are also different, 4.7.0 vs 4.3.0. Note that this module is quite complicated and I cannot simply go through all the code and all the makefiles.
When "inserting" the module I get a Cannot allocate memory with the following traces:
vmap allocation for size 30248960 failed: use vmalloc=<size> to increase size.
vmalloc: allocation failure: 30243566 bytes
insmod: page allocation failure: order:0, mode:0xd2
Pid: 5840, comm: insmod Tainted: G O 3.4.4-5.fc17.i686 #1
Call Trace:
[<c092702a>] ? printk+0x2d/0x2f
[<c04eff8d>] warn_alloc_failed+0xad/0xf0
[<c05178d9>] __vmalloc_node_range+0x169/0x1d0
[<c0517994>] __vmalloc_node+0x54/0x60
[<c0490825>] ? sys_init_module+0x65/0x1d80
[<c0517a60>] vmalloc+0x30/0x40
[<c0490825>] ? sys_init_module+0x65/0x1d80
[<c0490825>] sys_init_module+0x65/0x1d80
[<c050cda6>] ? handle_mm_fault+0xf6/0x1d0
[<c0932b30>] ? spurious_fault+0xae/0xae
[<c0932ce7>] ? do_page_fault+0x1b7/0x450
[<c093665f>] sysenter_do_call+0x12/0x28
-- clip --
The obvious answer seems to be that the module is allocating too much memory, however:
I have no problem with the old kernel version, what ever size this module is
if I prune some part of this module to get a much lower memory consumption, I will get always the same error message with the new kernel
I can unload a lot of other modules, but it has no impact (and is it anyway relevant? is there a global limit with Linux regarding the total memory usage by modules)
I am therefore suspecting a problem with the new kernel not directly related to limited memory.
The new kernel is complaining about a vmalloc() of 30,000 KB, but with the old kernel, an lsmod gives me a size of 4,800 KB. Should these figures be directly related? Is it possible that something went wrong during the build and that it is just too much RAM being requested? When I compile the sections size of both .ko, I do not see big differences.
So I am trying to understand where the problem is from. When I check the dumped stack, I am unable to find the matching piece of code. It seems that the faulty vmalloc() is done by sys_init_module(), which is init_module() from kernel/module.c. But the code does not match. When I check the object code from my .ko, the init_module() code also does not match.
I am more or less blocked as I do not know the kernel well enough, and all the build system and the module loading is quite tough to understand. The error occurs before the module is loaded, as I suspect that some functions are missing and insmod does not report these errors at this point.
I believe the allocation is done in layout_and_allocate, which is called by load_module. Both are static function, so they may be inlined, and therefore not on the stack.
So it's not an allocation done by your code, but an allocation done by Linux in order to load your code.
If your old kernel is 4.8MB and the new one is 30MB, it can explain why it fails.
So the question is why is it so large.
The size may be due to the amount of code (not likely that it has grown so much) or statically allocated data.
A likely explanation is that you have a large statically allocated array, whose size is defined in Linux. If the size has grown significantly, your array would grow.
A guess - an array whose size is NR_CPUS.
You should be able to use commands such as nm or objdump to find such an array. I'm not sure how exactly to do it however.
The problem was actually due to the debug sections in the module. The old kernel was able to ignore these sections, but the new one was counting them in the total size to allocate. However, when enabling the pr_debug() traces from module.c at loading time, these sections were not dumped with the others.
How to get rid of them and solve the problem:
objcopy -R .debug_aranges \
-R .debug_info \
-R .debug_abbrev \
-R .debug_line \
-R .debug_frame \
-R .debug_str \
-R .debug_loc \
-R .debug_ranges \
orignal.ko new.ko
It is also possible that the specific build files for this project were adding debug information "tailored" for the old kernel version, but when trying with a dummy module, I find exactly the same kind of debug sections appended, so I would rather suspect some policy change regarding module management in the kernel or in Fedora.
Any information regarding these changes are welcome.
My goal altogether is to figure out from a post mortem core file, why a specific process is consuming a lot of memory. Is there a summary that I can get somehow? As obvious valgrind is out of the question, because I can't get access to the process live.
First of all getting an output something similar to /proc/"pid"/maps, would help, but
maintenance info sections
(as described here: GDB: Listing all mapped memory regions for a crashed process) in gdb didn't show me heap memory consumption.
info proc map
is an option, as I can get access to machine with the exact same code, but as far as I have seen it is not correct. My process was using 700MB-s, but the maps seen only accounted for some 10 MBs. And I didn't see .so-s there which are visible in
maintenance print statistics
Do you know any other command which might be useful?
I can always instrument the code, but that's no easy. Along with reaching all the allocated data through pointers is like needle in the haystack.
Do you have any ideas?
Postmortem debugging of this sort in gdb is a bit of an art more than a science.
The most important tool for it, in my opinion, is the ability to write scripts that run inside of gdb. The manual will explain it to you. The reason I find this so useful is that it lets you do things like walking data structures and printing out information abou them.
Another possibility for you here is to instrument your version of malloc -- write a new malloc function that saves statistics about what is being allocated so that you can then look at those post mortem. You can, of course, call the original malloc to do the actual memory allocation work.
I'm sorry that I can't give you an obvious and simple answer that will simply yield an immediate fix for you here -- without tools like valgrind this is a very hard job.
If its Linux you dont have to worry about doing stats to your malloc. Use the utility called 'memusage'
for a sample program (sample_mem.c) like below
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
int main(voiid)
{
int i=1000;
char *buff=NULL;
srand(time(NULL));
while(i--)
{
buff = malloc(rand() % 64);
free(buff);
}
return 0;
}
the output of memusage will be
$memusage sample_mem
Memory usage summary: heap total: 31434, heap peak: 63, stack peak: 80
total calls total memory failed calls
malloc| 1000 31434 0
realloc| 0 0 0 (nomove:0, dec:0, free:0)
calloc| 0 0 0
free| 1000 31434
Histogram for block sizes:
0-15 253 25% ==================================================
16-31 253 25% ==================================================
32-47 247 24% ================================================
48-63 247 24% ================================================
but if your writing a malloc wapper then you can make your program coredump after this many number of malloc so that you can get a clue.
You might be able to use a simple tool like log-malloc.c which compiles into a shared library which is LD_PRELOADed before your application and logs all the malloc-type functions to a file. At least it might help narrow down the search in your dump.