Linux library for profiling - c

Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas

Check out gprof - it should do what you want.

I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.

Related

Does gprof support multithreaded applications?

We're developing a multithreaded project. My colleague said that gprof works perfectly with no work around with multithreaded programs. I read otherwise some time ago.
http://sam.zoy.org/writings/programming/gprof.html
http://lists.gnu.org/archive/html/bug-binutils/2010-05/msg00029.html
I also read this:
How to profile multi-threaded C++ application on Linux?
So I'm guessing the workaround is no longer needed? If so, since when is it not needed?
Unless you change the processing the gprof would work fine.
Changing the processing means using co-processor or gpus as computing units. In the worst case you have to manually call the setitimer function for every thread. But as per latest version, (2013-14) it's not needed.
In certain cases it behaves mischievously. So I advice to use the VTUNE from Intel which would give more accurate and more detailed information.

Simple cache profiling API

Is there a way I can access the (Intel]) hardware counters for each core programmatically? (that is, no perf, perfmon, or valgrind, and I should add "simple", so no PAPI, e.g.) I'd like to know (for each core) how many L1-LLC cache hits/misses it (= a certain program running on that core) incurred in. This is for Linux 3.2.0-32, C, and using GCC.
The performance counters in the processor can not be read from "user-mode" code, so you need some sort of kernel module to do this. Once you have that, it's not terribly hard, there are a number of MSR's.
You can perhaps also use /dev/cpu/core-number/msr to read the values without a kernel module.
To describe all the details of how you do this is a little too much for an answer (unless I copy'n'paste the entire section of Intel's programmers manual (Vol3) - which I don't think is quite what we want here...)

How to profile thread load balancing?

I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.
Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also
I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!
You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.
For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Using valgrind to find most intensive function?

I have a program running extremely slow. Is there a way to use valgrind to find out which function I need to optimize?
Thanks.
You can use the callgrind tool for valgrind, which should be part of each valgrind distribution. It runs the program in the valgrind "virtual machine" and counts the number of instructions spent in each function/line of code.
The best UI for visualizing the results is kcachegrind (part of KDE).
Advantage: It works quite well if your bottleneck is CPU-bound, as it's completely simulates the application so one gets very accurate and detailed results if CPU instructions is what interests you. If not, the results might be distorted.
Disadvantage: It's slow (like valgrind). If your problem is I/O-bound, the slow execution speed will distort the results (making I/O faster in comparison) and also influence the behavior. In such cases, a profiler taking samples is the better approach.
No, Valgrind is a dynamic analysis tool used to flesh out memory allocation errors and thread race-conditions (among other things).
You're looking for a code profiler, such as Luke Stackwalker. I don't know of any for *NIX systems off the top of my head, sorry.
Not as far as I know. oprofile is the best tool for what you want.

Resources