How to profile thread load balancing? - c

I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.

Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also

I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!

You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.

For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.

Related

Can we profile per core with dtrace?

Is dtrace usable in multithreaded applications, and can I profile individual cores? If so, would someone point me to an example?
DTrace is very suitable for lock analysis, due to it's ability to dynamically instrument lock events as required. The following commands and providers can be used for lock analysis, and were first shipped with the Solaris 10.
AS dtrace is usable for identify the lock analysis it can be used in multithreaded application you can check on http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks
Thanks & Regards,
Alok Thaker
There are a lot of different scripts here, for example threaded.d - sample multi-threaded CPU usage.

Profiling an OpenMP program with explicit openMP tasks

As part of my reserach I am looking for alternatives to profile an OpenMP code with explicit tasks (as per OpenMP 3.0). My main objective is to study the amount of overhead incurred when tasks are lying idle at a global barrier (such as a taskwait), prior to being scheduled and executed.
I looked into using the latest version of TAU, which has support for Opari which in turn instruments the source code to produce profiling statistic. Unfortunately since it instruments the source code, this is leading to large amount of overhead in program execution.
Tools like Gprof and PGprof do not provide the detail I am looking for. I have already tried and tested with them.
I am looking for a tool, which can help me in profiling an OpenMP program with tasks by levying minimum overhead. I am tempted to look into HPCToolkit and Scalasca, but I am not sure if they provide support for OpenMP tasks.
Looking for directions and your suggestions.
Thanks!!
Try LIKWID = Like I Knew What I’m Doing.
It is very reliable and free.

Simple cache profiling API

Is there a way I can access the (Intel]) hardware counters for each core programmatically? (that is, no perf, perfmon, or valgrind, and I should add "simple", so no PAPI, e.g.) I'd like to know (for each core) how many L1-LLC cache hits/misses it (= a certain program running on that core) incurred in. This is for Linux 3.2.0-32, C, and using GCC.
The performance counters in the processor can not be read from "user-mode" code, so you need some sort of kernel module to do this. Once you have that, it's not terribly hard, there are a number of MSR's.
You can perhaps also use /dev/cpu/core-number/msr to read the values without a kernel module.
To describe all the details of how you do this is a little too much for an answer (unless I copy'n'paste the entire section of Intel's programmers manual (Vol3) - which I don't think is quite what we want here...)

Multicore and OProfile

Is oprofile thread-aware/safe (meaning I can safely profile multithreaded apps), and if so, what is the difference with perf?
1 Yes, oprofile is thread aware.
Verbatim from man opcontrol (oprofile's control tool):
--separate=[none,lib,kernel,thread,cpu,all]
Separate samples based on the given separator. 'lib' separates dynamically
linked library samples per application. 'kernel' separates kernel and kernel module
samples per application; 'kernel' implies 'library'. 'thread' gives separation for
each thread and task. 'cpu' separates for each CPU. 'all' implies all of the above
options and 'none' turns off separation.
2 oprofile is system-wide profiler, it runs as a daemon and by default profiles all system activity.
Both Oprofile and Perf are thread-aware and can profile multithreaded apps. They can even profile the kernel if you ask them.
OProfile is a profiler (one tool that can record and annotate). It was one of the first (if not the first) profiler to actually use hardware performance counters.
Perf is a set of profiling tools to help you understand what's going on with an application (stat, top, record, annotate, etc.). It is part of the Linux kernel project (although the tools work in userland). It is still in active development, and from what i hear it happens from time to time that the API change dramatically.

Linux library for profiling

Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas
Check out gprof - it should do what you want.
I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.

Resources