Multicore and OProfile - c

Is oprofile thread-aware/safe (meaning I can safely profile multithreaded apps), and if so, what is the difference with perf?

1 Yes, oprofile is thread aware.
Verbatim from man opcontrol (oprofile's control tool):
--separate=[none,lib,kernel,thread,cpu,all]
Separate samples based on the given separator. 'lib' separates dynamically
linked library samples per application. 'kernel' separates kernel and kernel module
samples per application; 'kernel' implies 'library'. 'thread' gives separation for
each thread and task. 'cpu' separates for each CPU. 'all' implies all of the above
options and 'none' turns off separation.
2 oprofile is system-wide profiler, it runs as a daemon and by default profiles all system activity.

Both Oprofile and Perf are thread-aware and can profile multithreaded apps. They can even profile the kernel if you ask them.
OProfile is a profiler (one tool that can record and annotate). It was one of the first (if not the first) profiler to actually use hardware performance counters.
Perf is a set of profiling tools to help you understand what's going on with an application (stat, top, record, annotate, etc.). It is part of the Linux kernel project (although the tools work in userland). It is still in active development, and from what i hear it happens from time to time that the API change dramatically.

Related

Can we profile per core with dtrace?

Is dtrace usable in multithreaded applications, and can I profile individual cores? If so, would someone point me to an example?
DTrace is very suitable for lock analysis, due to it's ability to dynamically instrument lock events as required. The following commands and providers can be used for lock analysis, and were first shipped with the Solaris 10.
AS dtrace is usable for identify the lock analysis it can be used in multithreaded application you can check on http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks
Thanks & Regards,
Alok Thaker
There are a lot of different scripts here, for example threaded.d - sample multi-threaded CPU usage.

How to profile thread load balancing?

I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.
Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also
I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!
You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.
For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.

Linux library for profiling

Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas
Check out gprof - it should do what you want.
I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.

How to monitor machine code calls by binary program

My goal is to record the number of processor instructions executed by a given binary program through the duration of its run. While it's easy to get the actual machine code from the source code (through gdb or any other disassembler), this does not take into account function calls and branches within the program that cause instructions to be executed more than once or skipped altogether.
Is there a straightforward solution to this?
This is very hardware specific, but most processors offer a facility that counts the exact number of machine instructions (and other events) that have flowed through them. That's how profilers work to capture things like cache misses: by querying these internal registers.
The PAPI library provides calls to query this data on a variety of major processors. If you're on Linux+x86, PerfSuite gives you some more high-level tools which may be easier to start with.
Intel has a monitor app you can use to watch the chip's internal counters in realtime, and their Performance Analysis Guide describes the various Performance Monitoring Units on the chip and how to read them.
If you're on Linux, you should be able to run your program through cachegrind to get instruction counts.
It may also be possible to use ollydbg's Run Trace function to obtain an instruction count, but that may be limited by memory.
Alternately, it is possible to write a small debugger that simply runs the program in single steps.
The raw tools for tracking system calls are platform specific.
Solaris: truss or dtrace
MacOS X: dtrace
Linux: strace
HP-UX: tusc
AIX: truss
Windows: ...
For example (Solaris):
truss -o ls.truss ls $HOME
This will capture all the system calls made by ls as it lists your home directory.
OTOH, this may not be what you're after...in which case it is of limited value.

Running MPI code in my laptop

I am new to parallel computing world. Can you tell me is it possible to run a c++ code uses MPI routines in my laptop with dual core or is there any simulator/emulator for doing that?
Most MPI implementations use shared memory for communication between ranks that are located on the same host. Nothing special is required in terms of setting up the laptop.
Using a dual core laptop, you can run two ranks and the OS scheduler will tend to place them on separate cores. The WinXP scheduler tends to enforce some degree of "cpu binding" because by default jobs tend to be scheduled on the core where they last ran. However, most MPI implementations also allow for an explicit "cpu binding" that will force a rank to be scheduled on one specific core. The syntax for this is non-standard and must be gotten from the specific implementations documentation.
You should try to use "the same" version and implementation of MPI on your laptop that the university computers are running. That will help to ensure that the MPI runtime flags are the same.
Most MPI implementations ship with some kind of "compiler wrapper" or at least a set of instructions for building an application that will include the MPI library. Either use those wrappers, or follow those instructions.
If you are interested in a simulator of MPI applications, you should probably check SMPI.
This open-source simulator (in which I'm involved) can run many MPI C/C++/Fortran applications unmodified, and forecast rather accurately the runtime of the application, provided that you have an accurate description of your hardware platform. Both online and offline studies are possible.
There is many other advantages in using a simulator to study MPI applications:
Reproducibility: several runs lead to the exact same behavior unless you specify so. You won't have any heisenbugs where adding some more tracing changes the application behavior;
What-if Analysis: Ability to test on platform that you don't have access to, or that is not built yet;
Clairevoyance: you can observe every parts of the system, even at the network core.
For more information, see this presentation or this article.
The SMPI framework can even formally study the correction of MPI applications through exhaustive testing, as shown in that presentation.
MPI messages are transported via TCP networking (there are other high-performance possibilities like shared performance, but networking is the default). So it doesn't matter at all where the application runs as long as the nodes can connect to each other. I guess that you want to test the application on your laptop, so the nodes are all running locally and can easily connect to each other via the loopback network.
I am not quite sure if I do understand your question, but a laptop is a computer just like any other. Providing you have set up your MPI libs correctly and set your paths, you can, of course, use MPI routines on your laptop.
As far as I am concerned, I use Debian Linux (http://www.debian.org) for all my parallel stuff. I have written a little article dealing with HowTo get MPI run on debian machines. You may want to refer to it.

Resources