How to get detailed Nvidia GPU usage? - benchmarking

Nvidia-smi only provides a few metrics to measure GPU utilization. Most importantly, utilization.gpu represents the percent of time over the past sample period during which one or more kernels was executing on the GPU. Thus, it seems that a value of 100% does not at all indicate "full" GPU usage.
Alternatively, Nsight Compute provides many detailed metrics, but I found it to run very slowly on even small neural networks - it doesn't seem to be the use case. Another option seems to be DLProf, but this again only provides rather granular metrics such as "GPU utilization" and "Tensor Core Efficiency", whose definitions I could not find.
Therefore, is there another tool (or parameter) which provides detailed GPU usage metrics?

Have you considered trying DCGM?,including%20power%20and%20clock%20management.


How to get CPU instruction count for a thread?

I know that getrusage() can provide per-thread CPU utilization, but only the time spent on the CPU. Is there any way to get the number of executed CPU instructions? Or the number of cycles spent on the cpu?
Basically, I need to find a reproducible measure of how much the thread spends on the CPU. Any suggestions to do this in C?
UPDATE (to respond to comments):
Ideally I'd need this in a platform independent way, but Linux would be the most useful.
Reproducibility is the most important for me, even if that means the actual runtime may be slightly different.
I know vTune (and have used it), but I'd like to have this info programmatically while my code is running. So vTune is out, as well as the suggestions made in the post linked by Craig Estey.
I did look at the Intel Intrinsics Guide, but did not find anything useful...
Take a look at google's filament engine. They are doing exactly that.
Look at their profiler.
Also you can get more info from this link:

When does using more than one stream gain benefit in CUDA?

I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.
Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.
The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.
In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.
The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.
You can get started with analysis-driven optimization with presentations such as this one.
If you specified no stream, the stream 0 is used. According to wikipedia (you may also find it in the cudaDeviceProp structure), your GTX 780 GPU has 12 streaming multiprocessors which means there could be an improvement if you use multiple streams. The asyncEngineCount property will tell you how many concurrent asynchronous memory copies can run.
The idea of using streams is to use an asyncmemcopy engine (aka DMA engine) to overlap kernel executions and device2host transfers. The number of streams you should use for best performance is hard to guess because it depends on the number of DMA engines you have, the number of SMs and the balance between synchronizations/amount of concurrency. To get an idea you can read this presentation (for instance slides 5,6 explain the idea very well).
Edit: I agree that using a profiler is needed as a first step.

Is there a difference between a real time system and one that is just deterministic?

At work we're discussing the design of a new platform and one of the upper management types said it needed to run our current code base (C on Linux) but be real time because it needed to respond in less than a second to various inputs. I pointed out that:
That point doesn't mean it needs to be "real time" just that it needs a faster clock and more streamlining in its interrupt handling
One of the key points to consider is the OS that's being used. They wanted to stick with embedded Linux, I pointed out we need an RTOS. Using Linux will prevent "real time" because of the kernel/user space memory split thus I/O is done via files and sockets which introduce a delay
What we really need to determine is if it needs to be deterministic (needs to respond to input in <200ms 90% of the time for example).
Really in my mind if point 3 is true, then it needs to be a real time system, and then point 2 is the biggest consideration.
I felt confident answering, but then I was thinking about it later... What do others think? Am I on the right track here or am I missing something?
Is there any difference that I'm missing between a "real time" system and one that is just "deterministic"? And besides a RTC and a RTOS, am I missing anything major that is required to execute a true real time system?
Look forward to some great responses!
Got some good responses so far, looks like there's a little curiosity about my system and requirements so I'll add a few notes for those who are interested:
My company sells units in the 10s of thousands, so I don't want to go over kill on the price
Typically we sell a main processor board and an independent display. There's also an attached network of other CAN devices.
The board (currently) runs the devices and also acts as a webserver sending basic XML docs to the display for end users
The requirements come in here where management wants the display to be updated "quickly" (<1s), however the true constraints IMO come from the devices that can be attached over CAN. These devices are frequently motor controlled devices with requirements including "must respond in less than 200ms".
You need to distinguish between:
Hard realtime: there is an absolute limit on response time that must not be breached (counts as a failure) - e.g. this is appropriate for example when you are controlling robotic motors or medical devices where failure to meet a deadline could be catastrophic
Soft realtime: there is a requirement to respond quickly most of the time (perhaps 99.99%+), but it is acceptable for the time limit to be occasionally breached providing the response on average is very fast. e.g. this is appropriate when performing realtime animation in a computer game - missing a deadline might cause a skipped frame but won't fundamentally ruin the gaming experience
Soft realtime is readily achievable in most systems as long as you have adequate hardware and pay sufficient attention to identifying and optimising the bottlenecks. With some tuning, it's even possible to achieve in systems that have non-deterministic pauses (e.g. the garbage collection in Java).
Hard realtime requires dedicated OS support (to guarantee scheduling) and deterministic algorithms (so that once scheduled, a task is guaranteed to complete within the deadline). Getting this right is hard and requires careful design over the entire hardware/software stack.
It is important to note that most business apps don't require either: in particular I think that targeting a <1sec response time is far away from what most people would consider a "realtime" requirement. Having said that, if a response time is explicitly specified in the requirements then you can regard it as soft realtime with a fairly loose deadline.
From the definition of the real-time tag:
A task is real-time when the timeliness of the activities' completion is a functional requirement and correctness condition, rather than merely a performance metric. A real-time system is one where some (though perhaps not all) of the tasks are real-time tasks.
In other words, if something bad will happen if your system responds too slowly to meet a deadline, the system needs to be real-time and you will need a RTOS.
A real-time system does not need to be deterministic: if the response time randomly varies between 50ms and 150ms but the response time never exceeds 150ms then the system is non-deterministic but it is still real-time.
Maybe you could try to use RTLinux or RTAI if you have sufficient time to experiment with. With this, you can keep the non realtime applications on the linux, but the realtime applications will be moved to the RTOS part. In that case, you will(might) achieve <1second response time.
The advantages are -
Large amount of code can be re-used
You can manually partition realtime and non-realtime tasks and try to achieve the response <1s as you desire.
I think migration time will not be very high, since most of the code will be in linux
Just on a sidenote be careful about the hardware drivers that you might need to run on the realtime part.
The following architecture of RTLinux might help you to understand how this can be possible.
It sounds like you're on the right track with the RTOS. Different RTOSs prioritize different things either robustness or speed or something. You will need to figure out if you need a hard or soft RTOS and based on what you need, how your scheduler is going to be driven. One thing is for sure, there is a serious difference betweeen using a regular OS and a RTOS.
Note: perhaps for the truest real time system you will need hard event based resolution so that you can guarantee that your processes will execute when you expect them too.
RTOS or real-time operating system is designed for embedded applications. In a multitasking system, which handles critical applications operating systems must be
1.deterministic in memory allocation,
2.should allow CPU time to different threads, task, process,
3.kernel must be non-preemptive which means context switch must happen only after the end of task execution. etc
SO normal windows or Linux cannot be used.
example of RTOS in an embedded system: satellites, formula 1 cars, CAR navigation system.
Embedded System: System which is designed to perform a single or few dedicated functions.
The system with RTOS: also can be an embedded system but naturally RTOS will be used in the real-time system which will need to perform many functions.
Real-time System: System which can provide the output in a definite/predicted amount of time. this does not mean the real-time systems are faster.
Difference between both :
1.normal Embedded systems are not Real-Time System
2. Systems with RTOS are real-time systems.

Benchmark for cache-to-cache latency

I'm looking for a benchmark that can measure the cache latencies and bandwidth of the processors. In particular I need the measurement for cache-to-cache times from one core to another (including different die and different socket).
Something which runs on linux is required.
A web page showing the results of such tests on the most recent CPUs would also be a good compromise for now.
Try lmbench3, it has all kind of benchmarks including the ones you want

OpenMP debug newbie questions

I am starting to learn OpenMP, running examples (with gcc 4.3) from in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.