Do you know any profiler tool that tells you the number of total CPU operations a C/C++ program does? I need something like valgrind callgrind on linux...
Intel has some tools such as VTune. They also provide a performance counter library which you can use to instrument your code manually, by reading the hardware perf counter registers before and after a piece of code.
Visual Studio has an instrumented profiler but I don't know if it gets down to the "instructions retired" level of detail.
You should ask yourself what information you really want: do you want to count the number of cycles spent on a function, or do you really want to know how much wall-clock time your app is spending on each function generally? The latter is more useful in most cases, and you can get it more easily by sampling. (see also Mike Dunlavey's simple do-it-by-hand method which works for big hotspots.)
Counting actual instructions retired and branch mispredicts and so on is only useful if you really understand the details of the CPU pipeline and how to optimize around it. Microseconds-per-function is typically what you really want to optimize instead.
Related
I'm a high-school student doing some C things where I'd like to profile my code to see where the actual performance bottlenecks are. I don't have much money, so I'd prefer free tools.
I like to use the MinGW/GCC compiler toolchain. This is not something I'm stuck with, but I'd prefer tools that are capable of working with this.
Features I need:
See how much total time is spent in a certain function.
Features I'd like:
See how much time a line of code takes.
Cross-platform (being able to use the same software on Linux & Mac)
See how often a function gets called (and how long each call takes on average).
See what causes the time spent (cache misses, branch mispredictions, etc).
I've tried using gprof, but I couldn't get it to work (it only shows main in the profile), and I've heard bad things about it, so what are my options?
if you want a free, Windows and Linux TBP (it also does event based and some other metric based forms of profiling) then AMD's code analyst should do the job nicely (even on Intel cpus, though Im not sure of the quality/reliability of the branching and cache analysis on Intel cpus), its also got a nice ui built in Qt which does the source + assembly line time breakdowns. its also got an API to embed events for the profiler to catch for more targeted profiling.
is there any extension of valgrind, that can be used in the command window, that would help me know the time, in seconds, spent in each function in my C code?
thanks =)
For machine instruction profiling use valgrind's callgrind (also, cachegrind can do cache and branch prediction profiling which is quite nice).
For time measurements use google's cpu profiler, it gives way better results than gprof. You can set sampling frequency and it can show the output as a nice annotated call graph.
Valgrind isn't suited for measuring time, as running an application in valgrind distorts the results (slowdown, CPU vs. I/O). Thus valgrind profiling tool callgrind doesn't measure time but CPU instructions. Callgrind is only useful if your bottleneck is CPU-bound (thus CPU instructions matter), then CPU instructions measured will be in proportion to the time spent. It's not useful if heavy I/O or multiple processes are involved. Then you should use a sampling profiler, like sysprof or gprof (Edit 2020: perf). That checks in intervals which function the process is in, with less distorted results.
Use this link. I would think something like Callgrind should do the trick.
I'm working on a complex network software and I have trouble determining how to improve the systems performance.
Specifically in one part of the software which is using blocking synchronous calls. Since this part of the system is doing heavy computations it's nearly impossible to determine whether the slowness of this component is caused by these computations or the waiting for the other parts of the system.
Are there any light-weight profilers that can capture this information? I can't use heavy duty profile like valgrind since that would completely skew the results (although valgrind would be perfect, since it captures all the required information).
I tried using oProfile but I just wasn't able to get any meaningful results out of it (perhaps if there is a concise tutorial somewhere...).
What you need is something that gives you stack samples, on wall-clock time (not just CPU time like gprof), and reports by line (not just by function) the percent of samples containing the line.
Zoom will do it,
but I just do random-pausing. Here's why it works.
Here's a blow-by-blow example.
Here's another explanation.
Comment out your "heavy computations" and see if it's still slow. That will tell you if it's waiting on other systems over the network or the computations. The answer may not be either/or and may just be an accumulation of things.
You could also do some old fashioned printf debugging and print the time before and after executing the function to standard output or syslog. That is about as light-weight as profiling gets.
I have a program running extremely slow. Is there a way to use valgrind to find out which function I need to optimize?
Thanks.
You can use the callgrind tool for valgrind, which should be part of each valgrind distribution. It runs the program in the valgrind "virtual machine" and counts the number of instructions spent in each function/line of code.
The best UI for visualizing the results is kcachegrind (part of KDE).
Advantage: It works quite well if your bottleneck is CPU-bound, as it's completely simulates the application so one gets very accurate and detailed results if CPU instructions is what interests you. If not, the results might be distorted.
Disadvantage: It's slow (like valgrind). If your problem is I/O-bound, the slow execution speed will distort the results (making I/O faster in comparison) and also influence the behavior. In such cases, a profiler taking samples is the better approach.
No, Valgrind is a dynamic analysis tool used to flesh out memory allocation errors and thread race-conditions (among other things).
You're looking for a code profiler, such as Luke Stackwalker. I don't know of any for *NIX systems off the top of my head, sorry.
Not as far as I know. oprofile is the best tool for what you want.
I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now