I am using openmp to parallelize loops in my code in order to optimize it
I hear that openmp also shows something good or bad cache behavior
How do i see these cache interactions to arrange good cache behavior for my openmp omp pragma loop program?
OpenMP itself cannot be used to get information about the cache usage of your program. Depending on your platform there are some tools that will give insights to the cache behavior.
On Linux systems you can use perf.
perf stat -e cache-references,cache-misses <your-exe>
outputs statistics about the cache-misses. There are a lot more events that can be used (see here for further details). Common events are collected if you simply run:
perf stat <your-exe>
Another tool that can also be used for Windows is the Intel® Performance Counter Monitor. Although it only works with Intel CPUs it can collect additional information like the occupied memory bandwidth (on supported models).
However, the tools can help you to measure the cache usage of your program, but did not improve it. You have to manually optimize your code and recheck if the cache misses have been reduced.
If you're looking to a specific kernel, you might want to consider [PAPI].1
Related
Is there a programming approach to disable and enable smart cache capability in an intel cpu through c or c++ or maybe assembly code. i would like to measure algorithm performance with and without smart cache!is there such option availables or not?. I search alot but did not find anything useful. my cpu is intel 6700hq.
Smart cache is a architectural feature, and relies on a certain hardware structure to be present (in detail, the L2/L3 caches of individual cores to not be separated, as well as certain optimizations in data prefetch logic etc.). As such, it is highly unlikely that this feature can be disabled (although I was unable to find any reference on this).
As part of my reserach I am looking for alternatives to profile an OpenMP code with explicit tasks (as per OpenMP 3.0). My main objective is to study the amount of overhead incurred when tasks are lying idle at a global barrier (such as a taskwait), prior to being scheduled and executed.
I looked into using the latest version of TAU, which has support for Opari which in turn instruments the source code to produce profiling statistic. Unfortunately since it instruments the source code, this is leading to large amount of overhead in program execution.
Tools like Gprof and PGprof do not provide the detail I am looking for. I have already tried and tested with them.
I am looking for a tool, which can help me in profiling an OpenMP program with tasks by levying minimum overhead. I am tempted to look into HPCToolkit and Scalasca, but I am not sure if they provide support for OpenMP tasks.
Looking for directions and your suggestions.
Thanks!!
Try LIKWID = Like I Knew What I’m Doing.
It is very reliable and free.
Is there a way I can access the (Intel]) hardware counters for each core programmatically? (that is, no perf, perfmon, or valgrind, and I should add "simple", so no PAPI, e.g.) I'd like to know (for each core) how many L1-LLC cache hits/misses it (= a certain program running on that core) incurred in. This is for Linux 3.2.0-32, C, and using GCC.
The performance counters in the processor can not be read from "user-mode" code, so you need some sort of kernel module to do this. Once you have that, it's not terribly hard, there are a number of MSR's.
You can perhaps also use /dev/cpu/core-number/msr to read the values without a kernel module.
To describe all the details of how you do this is a little too much for an answer (unless I copy'n'paste the entire section of Intel's programmers manual (Vol3) - which I don't think is quite what we want here...)
I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.
Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also
I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!
You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.
For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.
Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas
Check out gprof - it should do what you want.
I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.