Capturing user-space variables at "perf" events

Capturing user-space variables at "perf" events - c

I've now been able to get perf to capture a user-space stack`, but I'm not sure how to convince it to capture values passed by reference as pointers, or to snapshot globals of interest.
Specifically, I'm trying to analyse the system-wide performance of PostgreSQL under various loads with and without a performance related patch. One of the key things I need to be able to do is tell which queries are associated with which block I/O requests in the kernel.
perf records the pid and the userspace stack, which sometimes contains the current_query, but since it's a string it's passed by reference so all I get is an opaque pointer. Not very useful. It doesn't appear in all traces, either, so ideally I'd fish the value out of the global PostgreSQL stores it in and get perf to record that with each trace sample. Matching the pid to the query after the fact might be viable, but a given PostgreSQL backend (pid) doesn't run just one query over its lifetime so lots of correlating timestamps between perf traces and PostgreSQL logs would be required.
This seems like the kind of thing you'd expect it to be able to do, since often a stack alone doesn't tell you all that much about what's going on and if it can already read the symbol table it should be able to look up globals and know which function arguments are pointers that need to be de-referenced and the first 'n' bytes copied.
I cannot for the life of me figure out how to do it or if this is possible, though. Am I just out of luck? Will I need to hack perf inject to merge this information from a separate timestamped log recorded by PostgreSQL?

It turns out that perf already has the required features for perf probe, but only for kernel-space at the moment.
perf probes can take arguments, which may be virtuals like $retval, registers like %ax, or c identifiers and simple expressions for local or global variables.
So, if perf did support user-space symbolic probes for arguments, you'd create a probe to capture the query_string argument to exec_simple_query is invoked with with something like:
perf probe -x /path/to/postgres exec_simple_query debug_query_string:string
The :string tells perf it's a C string, so it should deref the pointer and copy the data.
There are multiple places queries can come in - the simple protocol, the v3 parse/bind/execute protocol, the SPI, etc. This is only one of them. You could capture the query from the parser in raw_parse instead, or grab the debug_query_string global value from probes on events of interest.
Unfortunately, none of this will work yet, because perf won't do symbolic lookups on user-space binaries:
$ sudo perf probe -x /path/to/postgres exec_simple_query debug_query_string:string
Debuginfo-analysis is not yet supported with -x/--exec option.
Error: Failed to add events. (-38)
$ perf --version; uname -r
perf version 3.11.6
3.11.6-201.fc19.x86_64
So - if perf has support for symbolic lookups added you'll be able to do exciting things like capture the query text in the executor by looking up a struct member:
perf probe -x `which postgres` standard_ExecutorStart 'queryDesc->sourceText:string'
... but again, perf doesn't know how to do the required symbolic look-ups yet, and it can't capture C strings from registers and $retval. So: wait for a new perf, unless you're keen on enhancing the tooling yourself. Oh well.

Perf on Fedora 22 supports user-space probes:
# perf --version; uname -r
perf version 4.0.6-300.fc22.x86_64
4.0.4-301.fc22.x86_64

Related

Ensure that UID/GID check in system call is executed in RCU-critical section

Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?

First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.

The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html

How to use Readlink

How do I use Readlink for fetching the values.

The answer is:
Don't do it
At least not in the way you're proposing.
You specified a solution here without specifying what you really want to do [and why?]. That is, what are your needs/requirements? Assuming you get it, what do you want to do with the filename? You posted a bare fragment of your userspace application but didn't post any of your kernel code.
As a long time kernel programmer, I can tell you that this won't work, can't work, and is a terrible hack. There is a vast difference in methods to use inside the kernel vs. userspace.
/proc is strictly for userspace applications to snoop on kernel data. The /proc filesystem drivers assume userspace, so they always do copy_to_user. Data will be written to user address space, and not kernel address space, so this will never work from within the kernel.
Even if you could use /proc from within the kernel, it is a genuinely awful way to do it.
You can get the equivalent data, but it's a bit more complicated than that. If you're intercepting the read syscall inside the kernel, you [already] have access to the current task struct and the fd number used in the call. From this, you can locate the struct for the given open file, and get whatever you want, directly, without involving /proc at all. Use this as a starting point.
Note that doing this will necessitate that you read kernel documentation, sources for filesystem drivers, syscalls, etc. How to lock data structures and lists with the various locking methods (e.g. RCU, rw locks, spinlocks). Also, per-cpu variables. kernel thread preemptions. How to properly traverse the necessary filesystem related lists and structs to get the information you want. All this, without causing lockups, panics, segfaults, deadlocks, UB based on stale or inconsistent/dynamically changing data.
You'll need to study all this to become familiar with the way the kernel does things internally, and understand it, before you try doing something like this. If you had, you would have read the source code for the /proc drivers and already known why things were failing.
As a suggestion, forget anything that you've learned about how a userspace application does things. It won't apply here. Internally, the kernel is organized in a completely different way than what you've been used to.
You have no need to use readlink inside the kernel in this instance. That's the way a userspace application would have to do it, but in the kernel it's like driving 100 miles out of your way to get data you already have nearby, and, as I mentioned previously, won't even work.

how to know current thread running on which cpu in c [duplicate]

Presumably there is a library or simple asm blob that can get me the number of the current CPU that I am executing on.

Use sched_getcpu to determine the CPU on which the calling thread is running. See man getcpu (the system call) and man sched_getcpu (a library wrapper). However, note what it says:
The information placed in cpu is only guaranteed to be current at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must be prepared to handle the situation when cpu and node are no longer the current CPU and node.

You need to do something like:
Call sched_getaffinity and identify the CPU bits
Iterate over the CPUs, doing sched_setaffinity to each one
(I'm not sure if after sched_setaffinity you're guaranteed to be on the CPU, or
need to yield explicitly ?)
Execute CPUID (asm instruction)... there is a way of getting a unique per-core ID out of one of it's outputs (see Intel docs). I vaguely recall it's the "APIC ID".
Build a table (a std::map ?) from APIC IDs to a CPU number or affinity mask or something.
If you did this on your main thread, don't forget to set sched_setaffinity back to all CPUS!
Now you can CPUID again whenever you need to and lookup which core you're on.
But I'd query why you need to do this; normally you want to take control via sched_setaffinity rather than finding out which core you're on (and even that's a pretty rare thing to want/need). (That's why I don't know the crucial detail of what to pull out of CPUID exactly, sorry!)
Update: Just learned about sched_getcpu from litb's response here. Much better! (my Debian/etch libc is too old to have it though).

I don't know of anything to get your current core id. With kernel level task/process migration, you wouldn't be guaranteed that it would remain constant for any length of time, unless you were running in some form of real-time mode.
If you want to be on a specific core, you can put use that sched_setaffinity() function or the taskset command to launch your program. I believe that these need elevated permissions to work, though. In your program, you could then run sched_getaffinity() to see the mask that was set earlier and use that as a best guess at the core on which you are executing.

sysconf(_SC_NPROCESSORS_ONLN);

Is it possible to use vtune on certain code snippets in a binary and not an entire binary?

I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I imagined looking at a certain hardware events going into and out of a routine in my library within the context of the bigger binary but it seems I would need a specific kernel module for PAPI to work for me (Linux 2.6.18 && Intel Xeon 5570)...there is Vtune which is specifically geared for intel processors but it seems like it's something which would profile the entire binary for performance and not specific code snippets (the 3-4 calls into my library).
Is there a way for me to use Vtune for my goal, or possibly something which can give me access to such counters without having to patch my kernel?

Matias is right - you can start the profiling paused ("Start paused" in VTune-speak) and then in your program use __itt_pause / __itt_resume API from the VTune API to limit the data collection to the code region of interest.
You may also want to set the "Target duration type" to "Under one minute" in the project properties - this makes the sampling more fine grained (10 KHz instead of default 1 KHz frequency). Or manually adjust the Sample After value in the list of events to collect. The latter is often more useful when you want to profile something specific like mispredicted branches.

It's not possible to define an entry point in vtune to start recording.
However, what you can do, is to start the trace without recording and then when you expect to hit your library, you start the trace and let it record the calls. After the calls, you may stop it again and can now look up the library call using the top-bottom tab in vtune.
With it, you should be able to see all the information regarding the calls, and the time spent in each.
If you want to be sure that you only trace while the calls are active, you may start the application under gdb and insert breakpoints when accessing and leaving the functions you wish to examine and then start and stop the profiler appropriately.

Use callgrind as a sampling profiler?

I've been searching for a Linux sampling profiler, and callgrind has come the closest to showing useful results. However the overhead is estimated at 20--100x slower than normal. Additionally, I'm only interested in time spent per function (with particular emphasis on blocking calls such as read() and write(), which no other profiler will faithfully display).
Is there a way to turn off excess options, so that just the minimum data is recorded for generating times spent in various call stacks?
Does callgrind's cachegrind heritage imply that excess stuff is being done with regards to cache profiling etc?
I assume callgrind operates like a debugger. Can this be adjusted to sample the process at intervals, rather than every single instruction?

3) Callgrind is working like dynamic translator, which instruments orginal code with counting instrument code. Instrumenting is done for each memory access instruction in the code (for cache simulation), and (i suggest) for each jmp-like instruction to track exec. count of every basic block.
I have a small sampling profiler, which acts just like debugger; It does inject a setitimer() profiling counter into the application and then it does intercept all SIGALRM and prints current $eip value.
There were some sampling profilers with setitimer approach earlier, also there is a profil()for something like. This is used by glibc/gmon/gmon.c and gprof -p (to be exact, by gcc -pg). profil() function is able to profile single contonous code fragment with sampling a virtual cpu time each 1 or 10 millisecond. There is also sprofil() function.
Check also LD_PRELOAD=/lib/libpcprofile.so PCPROFILE_OUTPUT=output.file - but I don't know does it work or how it work
For numbered questions:
2) "Callgrind is an extension to Cachegrind. It provides all the information that Cachegrind does, plus extra information about callgraphs." - So it can provide any stuff that is in cachegrind, but also it allow user to turn off cache simulation: --simulate-cache=no (it is the default value)
For speed: According to http://www.valgrind.org/docs/manual/nl-manual.html - manual of Nul valgrind tool (aka nulgrind), which does no additional instrumentation, slowdown is 5 times. It is because program is dynamically translated by valgrind itself. So, there can be no tool for valgrind, which can work faster then nulgrind.

Have you tried gprof ? It does not have the big overhead as valgrind do.

Try using Zoom from RotateRight. It has a "Thread Time" configuration that samples all threads in a single process whether they are running or blocked.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight