microsecond profiler for C code - c

Does any body know of C code profiler like gprof which gives function call times in microseconds instead of milliseconds?

Take a look at Linux perf. You will need a pretty recent kernel though.

Let me just suggest how I would handle this, assuming you have the source code.
Knowing how long a function takes inclusively per invocation (including I/O), on average, multiplied by the number of invocations, divided by the total running time, would give you the fraction of time under the control of that function. That fraction is how you know if the function is a sufficient time-taker to bother optimizing. That is not easy information to get from gprof.
Another way to learn what fraction of inclusive time is spent under the control of each function is timed or random sampling of the call stack. If a function appears on a fraction X of the samples (even if it appears more than once in a sample), then X is the time-fraction it takes (within a margin of error). What's more, this gives you per-line fraction of time, not just per-function.
That fraction X is the most valuable information you can get, because that is the total amount of time you could potentially save by optimizing that function or line of code.
The Zoom profiler is a good tool for getting this information.
What I would do is wrap a long-running loop around the top-level code, so that it executes repeatedly, long enough to take at least several seconds. Then I would manually sample the stack by interrupting or pausing it at random. It actually takes very few samples, like 10 or 20, to get a really clear picture of the most time-consuming functions and/or lines of code.
Here's an example.
P.S. If you're worried about statistical accuracy, let me get quantitative. If a function or line of code is on the stack exactly 50% of the time, and you take 10 samples, then the number of samples that show it will be 5 +/- 1.6, for a margin of error of 16%. If the actual time is smaller or larger, the margin of error shrinks. You can also reduce the margin of error by taking more samples. To get 1.6%, take 1000 samples. Actually, once you've found the problem, it's up to you to decide if you need a smaller margin of error.

gprof gives results either in milliseconds or in microseconds. I do not know the exact rationale, but my experience is that it will display results in microseconds when it thinks that there is enough precision for it. To get microsecond output, you need to run the program for longer time and/or do not have any routine that takes too much time to run.

oprofile gets you times in clock resolution, i.e. nanoseconds, it produces output files compatible with gprof so very convenient to use.
http://oprofile.sourceforge.net/news/

Related

gprof is showing 0 computational time [duplicate]

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Profile C application with mixed CUDA

I have a C program with a major function that takes about 70% of total runtime. I used gprof to profile the application. After that, I rewrote that particular function in CUDA to boost the runtime of the whole application. It's currently giving results correctly but I want to know about the performance.
Is there anyway (or tool) I can use to profile this new application with the runtime of the new kernel as percentage of runtime with respect to the whole new application? I want to see the data relating all other remaining C functions as well. I tried using nvprof but it only outputs the runtimes of the CUDA kernels.
Thanks,
You can use the NVIDIA profiling tools to give you this information.
Running the command line tool nvprof <app> will give you the percentage and you can use additional command line options to optimise your kernel further. The visual profiler (nvvp) will show you the timeline and also the percentage time spent in the kernels, and it will also give you guidance on how to further improve the performance (including linking back to the documentation to explain concepts).
See the documentation for more info.
ADDENDUM
In your comment you say that you want to see the profile of the C functions as well. One way to do that would be to use nvtx to annotate your code, see this blog post for a way to automate that task. Alternatively you could profile in nvprof or nvvp to see the overall timeline and profile in gprof to see time spent in non-GPU code.
Well, you might know that I'm partial to this technique.
It will tell you approximate percentages spent by functions, lines of code, anything you can identify.
I assume your main program at some point has to wait for the CUDA kernels to finish processing, so the fraction of samples ending in that wait gives you an estimate of the time spent in CUDA.
Samples not ending in that wait, but doing other things, indicate the time spent doing those other things.
The statistics are pretty simple. If a line of code or function is on the stack for fraction F of time, then it is responsible for that fraction of time. So if you take N samples, the number of samples showing the line of code or function are, on average, NF. The standard deviation is sqrt(NF(1-F)).
So if F is 50% or 0.5, and you take 20 random stack samples, you can expect to see the code on 10 of them, with a standard deviation of sqrt(20*0.5*0.5) = 2.24 samples, or somewhere between 7 and 13 samples, most likely between 9 and 11.
In other words, you get a very rough measurement of the code's cost, but you know precisely what code has that cost.
If you're concerned about speed, you mainly care about the things that have a big cost, so it's good at pointing those out to you.
If you want to know why gprof doesn't tell you those things, there's a lot more to say about that.

How to figure performance of small expressions?

I want to figure performance of small expressions,to I decide what to use. Consider the below code. Several recursivee calls to it may happen.
void foo(void) {
i++;
if(etc(ch)) {
//..
}
else if(ch == TOKX) {
p=1;
baa();
c=0;
p=0;
}
//more ifs
}
Question:
Recursives calls may happen to foo(),and i should be incremented only if p has non-zero value(it means that it will be used in other part of code) Should I put a if(p) i++; or only leave i++;?
Is to answer(myself) questions like this that I'm looking for some tool. Someonee can believe that it's "loss of time" or say "optmization is root of evil".. but for cases like this,I don't believe that it is applicable to my situation. IMHO. Tell us your opinion if you think otherwise.
A "ideal" tool,could to show how long time each expression take to run.
It make me to think how is software debugging in biggest software's companies like IBM,Microsoft,Sun etc. Maybe it's theme to another thread.. more useful that this here,I think.
Platform: Should be Linux and MS-Windows.
The old adage is something like "don't optimize until you're sure, absolutely positive, you need to".. and there are reasons for that.
That said, here are a few thoughts:
avoid recursion if you can
at a macro level, something like the "time" command in linux can tell you how long your app is running. Put the method in a loop that runs 10k times and measure that, to average out the numbers
if you want to measure time spent in individual functions, profiling is what you want. Visual Studio has some good built-in stuff for this in Windows, but there are many, many options.
http://en.wikipedia.org/wiki/List_of_performance_analysis_tools
First please understand which measurements matter: there's total wall-clock time taken by the program, and there's percent of time each statement is active, where "active" means "on the stack".
Total wall-clock time is easily measured by subtracting system time after from system time before. If it is very short, just loop the code 1000 times, or whatever. You don't need many digits of precision.
Percent of time each statement is active is best measured by means of stack samples taken on wall-clock time (not CPU-only time). Any good profiler based on wall-clock stack sampling will work, such as Zoom or maybe Oprofile. It's not just the taking of samples that's important, but what is presented to you. It is best if it tells you "inclusive percent by line of code", which is simply the percent of stack samples containing the line of code. Again, you don't need many digits of precision, which means you don't need an enormous number of samples.
The reason inclusive percent by line of code is important, as opposed to other measurements (like self-time, function measurements, invocation counts, milliseconds, and so on) is that it represents the fraction of total wall clock time that line is responsible for, and would not be spent if it were not there.
If you could get rid of it, that tells you how much time it would save.

Is there a better way to benchmark a C program than timing?

I'm coding a little program that has to sort a large array (up to 4 million text strings). Seems like I'm doing quite well at it, since a combination of radixsort and mergesort already cut the original q(uick)sort execution time in less than half.
Execution time being the main point, since this is what I'm using to benchmark my piece of code.
My question is:
Is there a better (i. e. more reliable) way of benchmarking a program than just time the execution? It kinda works, but the same program (with the same background processes running) usually has slightly different execution times if run twice.
This kinda defeats the purpose of detecting small improvements. And several small improvements could add up to a big one...
Thanks in advance for any input!
Results:
I managed to get gprof to work under Windows (using gcc and MinGW). gcc behaves poorly (considering execution time) compared to my normal compiler (tcc), but it gave me quite some insight.
Try a profiling tool, that will also show you where the program is spending its time. gprof is the classic C profiling tool, at least on Unix.
Look at the time command. It tracks both the CPU time a process uses and the wall-clock time. You can also use something like gprof for profiling your code to find the parts of your program that are actually taking the most time. You could do a lower-tech version of profiling with timers in your code. Boost has a nice timer class, but it's easy to roll your own.
I don't think it's sufficient to just measure how long a piece of code takes to execute. Your environment is a constantly changing thing, so you have to take a statistical approach to measuring execution time.
Essentially you need to take N measurements, discard outliers, and calculate your average, median and standard deviation running time, with an uncertainty measurement.
Here's a good blog explaining why and how to do this (with code): http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html
What do you use for timing execution time so far? There's C89 clock() in time.h for starters. On unixoid systems you might find getitimer() for ITIMER_VIRTUAL to measure process CPU time. See the respective manual pages for details.
You can also use a POSIX shell's times utility to benchmark the processor time used by a process and its children. The resolution is system dependent, like just anything about profiling. Try to wrap your C code in a loop, executing it as many times as necessary to reduce the "jitter" in the time the benchmarking reports.
Call your routine from a test harness, whereby it executes N + 1 times. Ignore the timing for the first iteration and then take the average of iterations 1..N. The reason for ignoring the first time is that is is often slightly inflated due to various effects, e.g. virtual memory, code being paged in, etc. The reason for averaging N iterations is that you get rid of artefacts caused by other processes, the scheduler, etc.
If you're running on Linux or similar You might also want to use taskset to pin your code to a specific CPU core (assuming it's single-threaded), ideally not core 0, since this tends to handle all interrupts.

how to count cycles?

I'm trying to find the find the relative merits of 2 small functions in C. One that adds by loop, one that adds by explicit variables. The functions are irrelevant themselves, but I'd like someone to teach me how to count cycles so as to compare the algorithms. So f1 will take 10 cycles, while f2 will take 8. That's the kind of reasoning I would like to do. No performance measurements (e.g. gprof experiments) at this point, just good old instruction counting.
Is there a good way to do this? Are there tools? Documentation? I'm writing C, compiling with gcc on an x86 architecture.
http://icl.cs.utk.edu/papi/
PAPI_get_real_cyc(3) - return the total number of cycles since some arbitrary starting point
Assembler instruction rdtsc (Read Time-Stamp Counter) retun in EDX:EAX registers the current CPU ticks count, started at CPU reset. If your CPU runing at 3GHz then one tick is 1/3GHz.
EDIT:
Under MS windows the API call QueryPerformanceFrequency return the number of ticks per second.
Unfortunately timing the code is as error prone as visually counting instructions and clock cycles. Be it a debugger or other tool or re-compiling the code with a re-run 10000000 times and time it kind of thing, you change where things land in the cache line, the frequency of the cache hits and misses, etc. You can mitigate some of this by adding or removing some code upstream from the module of code being tested, (to cause a few instructions added and removed changing the alignment of your program and sometimes of your data).
With experience you can develop an eye for performance by looking at the disassembly (as well as the high level code). There is no substitute for timing the code, problem is timing the code is error prone. The experience comes from many experiements and trying to fully understand why adding or removing one instruction made no or dramatic differences. Why code added or removed in a completely different unrelated area of the module under test made huge performance differences on the module under test.
As GJ has written in another answer I also recommend using the "rdtsc" instruction (rather than calling some operating system function which looks right).
I've written quite a few answers on this topic. Rdtsc allows you to calculate the elapsed clock cycles in the code's "natural" execution environment rather than having to resort to calling it ten million times which may not be feasible as not all functions are black boxes.
If you want to calculate elapsed time you might want to shut off energy-saving on the CPUs. If it's only a matter of clock cycles this is not necessary.
If you are trying to compare the performance, the easiest way is to put your algorithm in a loop and run it 1000 or 1000000 times.
Once you are running it enough times that the small differences can be seen, run time ./my_program which will give you the amount of processor time that it used.
Do this a few times to get a sampling and compare the results.
Trying to count instructions won't help you on x86 architecture. This is because different instructions can take significantly different amounts of time to execute.
I would recommend using simulators. Take a look at PTLsim it will give you the number of cycles, other than that maybe you would like to take a look at some tools to count the number of times each assembly line is executing.
Use gcc -S your_program.c. -S tells gcc to generate the assembly listing, that will be named your_program.s.
There are plenty of high performance clocks around. QueryPerformanceCounter is microsofts. The general trick is to run the function 10s of thousands of time and time how long it takes. Then divide the time taken by the number of loops. You'll find that each loop takes a slightly different length of time so this testing over multiple passes is the only way to truly find out how long it takes.
This is not really a trivial question. Let me try to explain:
There are several tools on different OS to do exactly what you want, but those tools are usually part of a bigger environment. Every instruction is translated into a certain number of cycles, depending on the CPU the compiler ran on, and the CPU the program was executed.
I can't give you a definitive answer, because I do not have enough data to pass my judgement on, but I work for IBM in the database area and we use tools to measure cycles and instructures for our code and those traces are only valid for the actual CPU the program was compiled and was running on.
Depending on the internal structure of your CPU's piplining and on the effeciency of your compiler, the resulting code will most likely still have cache misses and other areas you have to worry about. (In that case you may want to look into FDPR...)
If you want to know how many cycles your program needs to run on your CPU (which was compiled with your compiler), you have to understand how the CPU works and how the compiler generarated the code.
I'm sorry, if the answer was not sufficient enough to solve your problem at hand. You said you are using gcc on an x86 arch. I would work with getting the assembly code mapped to your CPU.
I'm sure you will find some areas, where gcc could have done a better job...

Resources