I recently used Mike Dunlavey's "Poor Man's Profiling" (How can I profile C++ code running on Linux? ) to find some surprising sources of overhead in my code. Yay.
To summarize, run your program in a debugger and manually capture some backtraces. If you find the same stack traces keep showing up, you have a pretty good idea of where the overheads are in your program.
The "identify" step requires eyeballing the stack traces. Humans are pretty good at pattern matching but I'd like to automate this process a bit.
Surely there is a tool that can take N backtraces and show me which backtraces were in which routines?
There is something called Stack Trace Analysis Tool (https://github.com/LLNL/STAT) but it operates on executables itself -- very much in the High Performance Computing domain where you might wonder why/where a parallel (MPI) program is hung. Not quite what I'm looking for: I have already collected the backtraces.
TL;DR at end in bold if you don't want rationale/context (which I'm providing since it's always good to explain the core issue and not simply ask for help with Method X which might not be the best approach)
I frequently do software performance analysis on older hardware, which shows up race errors, single-frame graphical glitches and other issues more readily than more modern silicon.
Often, it would be really cool to be able to take screenshots of a misbehaving application that might render garbage for one or two frames or display erroneous values for a few fractions of a second. Unfortunately, problems most frequently arise when the systems in question are swapping heavily to disk, making it consistently unlikely that the screenshots I try to take will contain the bugs I'm trying to capture.
The obvious solution would be a capture device, and I definitely want to explore pixel-perfect image and video recording in the future when I have the resources for that (it sounds like a hugely fun opportunity to explore FPGAs).
I recently realized, however, that the kernel is what is performing the swapping, and that if I move screenshotting into kernelspace, well, I don't have to wait for my screenshot keystroke to make its way through the X input layer, into the screenshot program, wait for that to do its XSHM dance and get the screenshot data, all while the system is heavily I/O loaded (eg, 5-second system load of >10) - I can simply have the kernel memcpy() the displayed area of video memory to a preallocated buffer at the exact fraction of a second I hit PrtSc!
TL;DR: Where should I start looking to figure out how to "portably" (within the sense of Linux having different graphics drivers, each with different architectural designs) access the currently-displayed area of video memory?
I get the impression I should be looking at libdrm, possibly within KMS, but I would really appreciate some pointers to knowing what actually accesses video memory.
I'm also guessing there are probably some caveats and gotchas to reading video memory directly on certain chipsets? I don't expect my code to make it into the Linux kernel (who knows, but I doubt it) but I'd still like whatever I build to be fairly portable across computers for convenience.
NOTE: I am not using compositing with the systems in question, in case this changes anything. I'm interested to know whether I could write a compositing-compatible system; I suspect this would be nontrivial.
I need to profile a piece of software written in C. Now the problem is that while gprof or my own begin timer/end timer function calls would provide me time spent in each function, I would have no information about which is the most time consuming part within each function. Some may term it as micro-optimization but that is what the need of the hour is!
One of achieving this to "manually" place begin/end timer calls across for loops (there could be more than one of these). In this case, a smarter thing would be to allow enabling/disabling these calls using macros.
But I want to automate this instrumentation?
Can you tell me if a good tool exists to achieve the same? It would be ideal if I could invoke the instrumented program repeatedly from a script and then find the average of time spent in each "section" of the code. For now section is a loosely defined term but that "tool" could have a more specific definition of what a section is.
It would also be helpful if I could somehow learn which tools would be
You might try using Callgrind (one of Valgrind tools) in conjunction with KCachegrind. See also this question.
I have not used it myself, but I have heard that the Valgrind instrumentation framework (http://www.valgrind.org/) has tools that enable the very fine-grained profiling necessary for what you are trying to accomplish.
You want the code to run as fast as possible, right?
gprof is a measuring tool. It can help in evaluating alternative implementations, as the original authors wrote.
They did not say it is effective for locating the code needing an alternative implementation, and it isn't, even though nearly everyone thinks it is.
The fallacy is that measuring locates, but if you want to find an elephant in the room, do you need to measure it to know it's there?
No, you open your eyes.
Here's a way to open your eyes to what your program is doing.
I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.
I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now