The Mechanism Used to Determine Library Load Address in Perf

The Mechanism Used to Determine Library Load Address in Perf - linker

How does perf determine the load addresses for each loaded image (e.g., shared libraries) during post-processing. For example, perf report uses this information to make each symbol address relative to the beginning of each loaded image. This is shown in the image below (unwind: _int_malloc...):
Is it stored somewhere in the elf binary or profiling output (i.e., perf.data)?

Shared libraries load address are stored inside the perf.data file recorded during perf record command. You can use perf script -D command to dump the data from perf.data in partially decoded format. When your program is loaded by ld-linux*.so.2 (or when required with dlopen), loader will search for library and load its segments using mmap syscall. These mmap events are recorded by kernel and have type PERF_RECORD_MMAP or PERF_RECORD_MMAP2 in perf.data file. And perf report (and perf script) will reconstruct memory offsets to decode symbol names.
$ perf record echo 1
$ perf script -D|grep MMAP -c
7
$ perf script -D|less
PERF_RECORD_MMAP2 ... r-xp /bin/echo
...
PERF_RECORD_MMAP2 ... r-xp /lib/x86_64-linux-gnu/libc-2.27.so
Basic ideas of perf are described in https://github.com/torvalds/linux/blob/master/tools/perf/design.txt file. To start profiling there is perf_event_open syscall which has perf_event_attr *attr argument. Man page describes mmap-related fields of attr:
The perf_event_attr structure provides detailed configuration
information for the event being created.
mmap : 1, /* include mmap data */
mmap_data : 1, /* non-exec mmap data */
mmap2 : 1, /* include mmap with inode data */
Linux kernel in its perf_events subsystem (kernel/events) will record required events for profiled processes and export the data with fd and mmap to the profiler. perf record usually dumps this data from kernel into perf.data file without heavy processing (check "Woken up 1 times to write data" prints of your perf record output). Mmap events in kernel are recorded by perf_event_mmap_output called from perf_event_mmap_event which is called from perf_event_mmap. mmap syscall implementation in mm/mmap.c has some unconditional calls to perf_event_mmap.
perf's design.txt mentions munmap, but current implementation has no munmap field or event, event code 2 was reused to PERF_RECORD_LOST. There were ideas that munmap can be helpful https://www.spinics.net/lists/netdev/msg524414.html with links to https://lkml.org/lkml/2016/12/10/1 and https://lkml.org/lkml/2017/1/27/452
perf tool is part of linux kernel sources and can be viewed online with LXR/elixir website: https://elixir.bootlin.com/linux/v5.4/source/tools/perf/
Processing code for mmap/mmap2 events is in perf/util/machine.c machine__process_mmap_event and machine__process_mmap2_event; logged mmap arguments, returned address, offset and file name are recorded with help of map__new and thread__insert_map for the process (pid/tid) and used later to convert sample event address into symbol name.
PS: Your perf.data has size of 300+ MB, this is huge and processing can be slow. For long running programs you may want to lower perf record event sampling frequency with -F freq option of perf record: perf record -F40 or with -c option.

You can't find it just by looking at the ELF executable and libraries. It can vary each run; even if ASLR was disabled by perf like GDB does, a program might use mmap(MAP_FIXED) before using dlopen of some optional library only loaded later, so dlopen would have to pick a different address than usual to map a library. (Normal dynamic linking happens before main runs, not via dlopen, but presumably perf wants to be able to record addresses -> file mappings for any file-backed mapping.)
Probably perf saves the runtime base address of each ELF object in perf.data

Related

gdb core dump warning: Can't open file /memfd:magicringbuffer (deleted) during file-backed mapping note processing

I implemented a magic ring buffer (MRB) on linux using memfd_create, ftruncate, mmap, and munmap. The fd returned by memfd_create gets close()'d after the buffer is fully constructed. The MRB itself runs and works perfectly fine.
The problem:
One tries to create a core-file on a process running this MRB with gcore.
They then try to use gdb <executable> -c <core-file>
gdb then prints a warning:
warning: Can't open file /memfd:magicringbuffer (deleted) during file-backed mapping note processing
Additional notes:
"magicringbuffer" is the string passed as the name parameter in memfd_create(const char *name, unsigned int flags);
built and run on CentOS version 7
Questions:
What does this warning exactly mean? What causes it? Is it because the "file" is virtual? or because it was close()'d?
What are the implications of it? Could it lead to missing debug symbols? The <executable> is indeed a binary with debug symbols
I tried to look for an answer on the internet, but I found nothing satisfactory.

GDB is trying to reconstruct the virtual address space of the former process, at the time of the core dump, as accurately as possible. This includes re-creating all mmap regions. The message means simply that GDB tried, and failed, to re-create the mmap region that was backed by the memfd. IIRC, the annotation in the core file that tells GDB that an mmap region existed -- the "file-backed mapping note" -- was designed before memfd_create was a thing, and so GDB doesn't know it should be calling memfd_create() instead of regular old open() for this one. And even if it did, it wouldn't be able to recover access to the original memfd area (which might be completely gone by the time you get around to debugging from the core dump).
The practical upshot of this is that, while debugging from the core dump, you won't be able to look at the contents of memory within your magic ring buffer. Debug symbols, however, should be unaffected.
This is arguably a bug in either the kernel or gcore (not sure which); the contents of memfd-backed memory regions should arguably be dumped into the core file like MAP_ANONYMOUS regions, rather than generating file-backed mapping notes.

Capturing user-space variables at "perf" events

I've now been able to get perf to capture a user-space stack`, but I'm not sure how to convince it to capture values passed by reference as pointers, or to snapshot globals of interest.
Specifically, I'm trying to analyse the system-wide performance of PostgreSQL under various loads with and without a performance related patch. One of the key things I need to be able to do is tell which queries are associated with which block I/O requests in the kernel.
perf records the pid and the userspace stack, which sometimes contains the current_query, but since it's a string it's passed by reference so all I get is an opaque pointer. Not very useful. It doesn't appear in all traces, either, so ideally I'd fish the value out of the global PostgreSQL stores it in and get perf to record that with each trace sample. Matching the pid to the query after the fact might be viable, but a given PostgreSQL backend (pid) doesn't run just one query over its lifetime so lots of correlating timestamps between perf traces and PostgreSQL logs would be required.
This seems like the kind of thing you'd expect it to be able to do, since often a stack alone doesn't tell you all that much about what's going on and if it can already read the symbol table it should be able to look up globals and know which function arguments are pointers that need to be de-referenced and the first 'n' bytes copied.
I cannot for the life of me figure out how to do it or if this is possible, though. Am I just out of luck? Will I need to hack perf inject to merge this information from a separate timestamped log recorded by PostgreSQL?

It turns out that perf already has the required features for perf probe, but only for kernel-space at the moment.
perf probes can take arguments, which may be virtuals like $retval, registers like %ax, or c identifiers and simple expressions for local or global variables.
So, if perf did support user-space symbolic probes for arguments, you'd create a probe to capture the query_string argument to exec_simple_query is invoked with with something like:
perf probe -x /path/to/postgres exec_simple_query debug_query_string:string
The :string tells perf it's a C string, so it should deref the pointer and copy the data.
There are multiple places queries can come in - the simple protocol, the v3 parse/bind/execute protocol, the SPI, etc. This is only one of them. You could capture the query from the parser in raw_parse instead, or grab the debug_query_string global value from probes on events of interest.
Unfortunately, none of this will work yet, because perf won't do symbolic lookups on user-space binaries:
$ sudo perf probe -x /path/to/postgres exec_simple_query debug_query_string:string
Debuginfo-analysis is not yet supported with -x/--exec option.
Error: Failed to add events. (-38)
$ perf --version; uname -r
perf version 3.11.6
3.11.6-201.fc19.x86_64
So - if perf has support for symbolic lookups added you'll be able to do exciting things like capture the query text in the executor by looking up a struct member:
perf probe -x `which postgres` standard_ExecutorStart 'queryDesc->sourceText:string'
... but again, perf doesn't know how to do the required symbolic look-ups yet, and it can't capture C strings from registers and $retval. So: wait for a new perf, unless you're keen on enhancing the tooling yourself. Oh well.

Perf on Fedora 22 supports user-space probes:
# perf --version; uname -r
perf version 4.0.6-300.fc22.x86_64
4.0.4-301.fc22.x86_64

Alternatives to ipcs

I've got an application that makes use of System V shared memory segments. Normally it manages these internally and no one needs to touch them. But for emergencies we've got a utility that manually clears the shared memory segments.
The problem is that to do it, it runs ipcs, and grabs chunks of the output using cut. That seems pretty fragile. It already runs slightly different commands on different platforms to reflect the fact that the ipcs output is formatted differently on Linux / AIX / Solaris etc.
Is there a better way to find shared memory segments than parsing ipcs output?

You could reimplement your own version of ipcs that gives the same output regardless of the OS. This requires some system-level programming though.
On Linux ipcs uses shmctl(0, SHM_INFO, ...) to find out the index of the highest used shared memory segment, then runs shmctl(index, SHM_STAT, ...) in a loop over all indexes from 0 to the highest index in order to obtain information about each segment. This should also work on FreeBSD (not documented but apparent from the kernel source), although on that OS ipcs uses sysctl to read the values of kern.ipc.shm*.
On Solaris ipcs uses shmids(NULL, 0, &nids) to obtain the number of segments IDs, then calls shmids(&ids, nids, ...) to obtain the list of actual IDs and then uses shmctl(id, IPC_STAT, ...) to obtain information on each segment.
ipcs is a fairly old instrument and one would not expect its output to change much in the future, at least not until POSIX shared memory completely displaces SysV IPC.

How to measure the memory usage of a process without calling an external program

The memory usage of a process can be displayed by running:
$ ps -C processname -o size
SIZE
3808
Is there any way to retrieve this information without executing ps (or any external program), or reading /proc?

On a Linux system, a process' memory usage can be queried by reading /proc/[pid]/statm. Where [pid] is the PID of the process. If a process wants to query its own data, it can do so by reading /proc/self/statm instead. man 5 proc says:
/proc/[pid]/statm
Provides information about memory usage, measured in pages. The
columns are:
size total program size
(same as VmSize in /proc/[pid]/status)
resident resident set size
(same as VmRSS in /proc/[pid]/status)
share shared pages (from shared mappings)
text text (code)
lib library (unused in Linux 2.6)
data data + stack
dt dirty pages (unused in Linux 2.6)
You could just open the file with: fopen("/proc/self/statm", "r") and read the contents.
Since the file returns results in 'pages', you will want to find the page size also. getpagesize () returns the size of a page, in bytes.

You have a few options to do find the memory usage of a program:
Run it within a profiler like Valgrind or memprof.
exec/proc_open/fork a new process to use ps, top, or pmap as you would from the command line
bundle the ps into your app and use it directly (it's open source, of course!)
Use the /proc system (which is all that ps does, anyways...)
Create a report the kernel, which watches over process memory operations. The /proc filesystem is just a view into the kernel's internal data structures, so this is really already done for you.
Develop your own mechanism to compute memory usage without kernel assistance.
The former are all educational from a system administration perspective, and would be the best options in a real-life situation, but the last bullet point is probably the most interesting. You'd probably want to read the source of Valgrind or memprof to see how it works, but essentially what you'd need to do is insert your mechanism between the app and the kernel, and intercept any requests for memory allocation. Additionally, when the process started, you would want to initialize its memory space with a preset value like 0xDEADBEEF. Then, after the process finished, you could read through the memory space and count the occurrences of words other than your preset value, giving you an estimate of memory usage.
Of course, things are always more complicated than they seem. What about memory used by shared libraries? Pipes? Shared memory between your processes and another? System calls? Virtual memory allocated but not used? Data buffered to the disk? There's a lot of calls to be made beyond your question 'memory of process', see this post for some additional concerns.

how to find different memory segment starting and its size in linux

I am new to linux. I want to know the starting address and its size of different segments (like stack, heap, data etc.) and its current usage.
I like to know how to find both in running process and in core dump.
Thanks in advance.

start by looking into the proc(5) filesystem. man is your friend.
/proc/[number]/maps A file containing the currently mapped memory regions and their access permissions
in gdb, you can use
$ gdb -q
(gdb) help info proc
Show /proc process information about any running process.
Specify any process id, or use the program being debugged by default.
Specify any of the following keywords for detailed info:
mappings -- list of mapped memory regions.
stat -- list a bunch of random process info.
status -- list a different bunch of random process info.
all -- list all available /proc info.
have a look at info proc mappings, except it doesn't work when there is no /proc (such as during pos-mortem debugging).

objdump on Linux gives information about a binary. Check man objdump. It gives - sections, disassembly, debugging symbols.
objdump -h <binary>
objdump --section=name
Better way, if possible(if you can build the executable yourself from source) generate a map file while compiling and linking the source code, by giving appropriate compiler/linker option. The map file will sure have all the information about sizes, starting addresses of different sections.

There is the pmap command. It displays the information available in /proc/PID/maps in different ways. Plus, it adds header and summary lines. This may be more readable to you than the /proc/PID/maps pseudo file.
Sadly, it does not have the ability to analyze core dumps.

Use `maintenance info sections' within gdb to print all the segments that are mapped into the process address space.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight