Profiling an OpenMP program with explicit openMP tasks - c

As part of my reserach I am looking for alternatives to profile an OpenMP code with explicit tasks (as per OpenMP 3.0). My main objective is to study the amount of overhead incurred when tasks are lying idle at a global barrier (such as a taskwait), prior to being scheduled and executed.
I looked into using the latest version of TAU, which has support for Opari which in turn instruments the source code to produce profiling statistic. Unfortunately since it instruments the source code, this is leading to large amount of overhead in program execution.
Tools like Gprof and PGprof do not provide the detail I am looking for. I have already tried and tested with them.
I am looking for a tool, which can help me in profiling an OpenMP program with tasks by levying minimum overhead. I am tempted to look into HPCToolkit and Scalasca, but I am not sure if they provide support for OpenMP tasks.
Looking for directions and your suggestions.
Thanks!!

Try LIKWID = Like I Knew What I’m Doing.
It is very reliable and free.

Related

How to profile thread load balancing?

I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.
Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also
I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!
You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.
For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.

Linux library for profiling

Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas
Check out gprof - it should do what you want.
I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.

CPU operations windows profiler tools

Do you know any profiler tool that tells you the number of total CPU operations a C/C++ program does? I need something like valgrind callgrind on linux...
Intel has some tools such as VTune. They also provide a performance counter library which you can use to instrument your code manually, by reading the hardware perf counter registers before and after a piece of code.
Visual Studio has an instrumented profiler but I don't know if it gets down to the "instructions retired" level of detail.
You should ask yourself what information you really want: do you want to count the number of cycles spent on a function, or do you really want to know how much wall-clock time your app is spending on each function generally? The latter is more useful in most cases, and you can get it more easily by sampling. (see also Mike Dunlavey's simple do-it-by-hand method which works for big hotspots.)
Counting actual instructions retired and branch mispredicts and so on is only useful if you really understand the details of the CPU pipeline and how to optimize around it. Microseconds-per-function is typically what you really want to optimize instead.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

How do you profile your code?

I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now

Resources