How to get CPU instruction count for a thread? - c

I know that getrusage() can provide per-thread CPU utilization, but only the time spent on the CPU. Is there any way to get the number of executed CPU instructions? Or the number of cycles spent on the cpu?
Basically, I need to find a reproducible measure of how much the thread spends on the CPU. Any suggestions to do this in C?
UPDATE (to respond to comments):
Ideally I'd need this in a platform independent way, but Linux would be the most useful.
Reproducibility is the most important for me, even if that means the actual runtime may be slightly different.
I know vTune (and have used it), but I'd like to have this info programmatically while my code is running. So vTune is out, as well as the suggestions made in the post linked by Craig Estey.
I did look at the Intel Intrinsics Guide, but did not find anything useful...

Take a look at google's filament engine. They are doing exactly that.
Look at their profiler.
https://github.com/google/filament/blob/master/libs/utils/src/Profiler.cpp
Also you can get more info from this link:
https://www.youtube.com/watch?v=Lcq_fzet9Iw

Related

Ultra-low latency programming on Linux, where to begin?

I heard there are some ways to modify linux such that an particular application can obtain very low latency such that whenver it ask resources, the OS will try to give it the resource as soon as possible, kind of overriding the default preemptive multitasking mechanism, I dont have a CS background, but the application I am working-on is very latency-sensitive, can anyone tell me are there any docs/stuff on this specific matter? many thanks.
Guaranteed low-latency response is called the real time capability. It means that timing goals that are realistic are guaranteed to be met.
There is a project for it called RTLinux. See the Real-Time Linux Wiki: https://rt.wiki.kernel.org/index.php/Main_Page
There are two real time models :
soft real time system - you get it by applying RT preempt kernel patches. I think it guaranties context switch within 10 ms. The goal of this project is to conform to hard real time requirements
hard real time system - have stricter guaranties (response of 1 ms). There are some libraries (like xenomai) that claim they provide hard real time system.

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

Is QueryPerformanceCounter guaranteed to give you time since boot?

Is it safe to assume that the count returned from QueryPerformanceCounter relates to the time since the last system boot? Or could it be reset while the system is running? The MSDN article itself doesn't guarantee this, however I've seen some 3rd party information (such as this) that says that this is the case.
It's meant to be used for relative times. But I don't think it can be used to measure time since boot.
From what I hear, it's implemented using the rdtsc instruction which measures "pseudo" CPU cycles since the CPU was powered on. In that case, yes, it probably does give the time since boot, but I don't think this is specified.

What core is a given thread running on?

Is there a function or any other way to know, programatically, what core of what processor a given thread of my program (pid) is running on? Both OpenMP or Pthreads solutions would help me, if possible. Thanks.
I think on Linux one can try sched_getcpu().
This is going to be platform-specific, I would think. On Windows you can use NtGetCurrentProcessorNumber, but this is caveat-ed as possibly disappearing.
I expect this is hard to do, because there's nothing to stop the thread being moved to a new core at any time (in most apps, anyway). As soon as you get the result, it could be out of date.
For pthreads, I think sched_getaffinity() is at least part of the solution. Not sure exactly how pthreads names the CPU:s and cores, though.
This is hard to do portably, as the answer depends both on hardware and OS.
The hardware locality library is a new tool which allows you to query CPU/core/thread etc information (and set affinity bindings) in an OS/hardware agnostic way. It supports a huge list of hardware and OSes, and so should add a lot of portability to these sorts of queries. Once you map out your system's topology, hwloc_get_last_cpu_location will return the CPU the thread last ran on, where CPU can mean core or hardware thread.

OpenMP debug newbie questions

I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.

Resources