I have a C program with a major function that takes about 70% of total runtime. I used gprof to profile the application. After that, I rewrote that particular function in CUDA to boost the runtime of the whole application. It's currently giving results correctly but I want to know about the performance.
Is there anyway (or tool) I can use to profile this new application with the runtime of the new kernel as percentage of runtime with respect to the whole new application? I want to see the data relating all other remaining C functions as well. I tried using nvprof but it only outputs the runtimes of the CUDA kernels.

You can use the NVIDIA profiling tools to give you this information.
Running the command line tool nvprof <app> will give you the percentage and you can use additional command line options to optimise your kernel further. The visual profiler (nvvp) will show you the timeline and also the percentage time spent in the kernels, and it will also give you guidance on how to further improve the performance (including linking back to the documentation to explain concepts).
See the documentation for more info.
In your comment you say that you want to see the profile of the C functions as well. One way to do that would be to use nvtx to annotate your code, see this blog post for a way to automate that task. Alternatively you could profile in nvprof or nvvp to see the overall timeline and profile in gprof to see time spent in non-GPU code.

Well, you might know that I'm partial to this technique.
It will tell you approximate percentages spent by functions, lines of code, anything you can identify.
I assume your main program at some point has to wait for the CUDA kernels to finish processing, so the fraction of samples ending in that wait gives you an estimate of the time spent in CUDA.
Samples not ending in that wait, but doing other things, indicate the time spent doing those other things.
The statistics are pretty simple. If a line of code or function is on the stack for fraction F of time, then it is responsible for that fraction of time. So if you take N samples, the number of samples showing the line of code or function are, on average, NF. The standard deviation is sqrt(NF(1-F)).
So if F is 50% or 0.5, and you take 20 random stack samples, you can expect to see the code on 10 of them, with a standard deviation of sqrt(20*0.5*0.5) = 2.24 samples, or somewhere between 7 and 13 samples, most likely between 9 and 11.
In other words, you get a very rough measurement of the code's cost, but you know precisely what code has that cost.
If you're concerned about speed, you mainly care about the things that have a big cost, so it's good at pointing those out to you.
If you want to know why gprof doesn't tell you those things, there's a lot more to say about that.


I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Should I also measure clCreateContext() when profiling OpenCL code?

Recently I'm programming OpenCL code which handles some images.
After completing the code, I need to benchmark OpenCL code and native C(or C++) code which does same job.
My question arouses from above. Specifically which steps should I contain to time measuring?
Majority of books and questions on StackOverflow only measures time of executing clEnqueueNDRangeKernel() with using clGetEventProfilingInfo() and clWaitForEvents().
My senior said I need to contain buffer copying jobs(C memory to cl_mem) since native C code doesn't have such steps.
Then should I contain program creating & kernel building step, argument setting step, *.cl source code file reading step, and (most curious stuff)clCreateContext() step?
According to [this paper], clCreateContext() consumes largest time compared with other steps like below.
Android OpenCL code example from SONY also only gets elapsed time of clEnqueueNDRangeKernel(). Check here ->
If above is right, is it right that I should only measure the very native C code which does same job in OpenCL kernel code?
Or are there various perspective to profiling and comparing OpenCL & native C code?
PLUS: My program is going to handle continuous image (like video) so there'll be frequent memory copy between GPU and other memory. Then I should also get time for copying memory in both OpenCL code and native C code, right?
I mean, that obviously depends on what you need to measure.
Generally, if you care about the total run time of your program, measure the total runtime, including context creation.
In reality, you usually don't use openCL to do workloads that, over the whole life time of a program, take less time than the context creation. If that is the case, I'd be sure to check whether using openCL makes sense, at all. OpenCL is a single instruction, much much much data architecture. Hence, I think you might be constructing testbenches with simply too little work to be done to ever get statistically sufficient data.
For example, the timers you use to measure the time something takes to execute have some granularity, typically in the multiples of microseconds. If your workload takes shorter than let's say 500 µs, then what you're measuring is practically unusable as benchmark. This is a common problem for the performance comparison of many things!

Is there a better way to benchmark a C program than timing?

I'm coding a little program that has to sort a large array (up to 4 million text strings). Seems like I'm doing quite well at it, since a combination of radixsort and mergesort already cut the original q(uick)sort execution time in less than half.
Execution time being the main point, since this is what I'm using to benchmark my piece of code.
My question is:
Is there a better (i. e. more reliable) way of benchmarking a program than just time the execution? It kinda works, but the same program (with the same background processes running) usually has slightly different execution times if run twice.
This kinda defeats the purpose of detecting small improvements. And several small improvements could add up to a big one...
Thanks in advance for any input!
I managed to get gprof to work under Windows (using gcc and MinGW). gcc behaves poorly (considering execution time) compared to my normal compiler (tcc), but it gave me quite some insight.
Try a profiling tool, that will also show you where the program is spending its time. gprof is the classic C profiling tool, at least on Unix.
Look at the time command. It tracks both the CPU time a process uses and the wall-clock time. You can also use something like gprof for profiling your code to find the parts of your program that are actually taking the most time. You could do a lower-tech version of profiling with timers in your code. Boost has a nice timer class, but it's easy to roll your own.
I don't think it's sufficient to just measure how long a piece of code takes to execute. Your environment is a constantly changing thing, so you have to take a statistical approach to measuring execution time.
Essentially you need to take N measurements, discard outliers, and calculate your average, median and standard deviation running time, with an uncertainty measurement.
Here's a good blog explaining why and how to do this (with code):
What do you use for timing execution time so far? There's C89 clock() in time.h for starters. On unixoid systems you might find getitimer() for ITIMER_VIRTUAL to measure process CPU time. See the respective manual pages for details.
You can also use a POSIX shell's times utility to benchmark the processor time used by a process and its children. The resolution is system dependent, like just anything about profiling. Try to wrap your C code in a loop, executing it as many times as necessary to reduce the "jitter" in the time the benchmarking reports.
Call your routine from a test harness, whereby it executes N + 1 times. Ignore the timing for the first iteration and then take the average of iterations 1..N. The reason for ignoring the first time is that is is often slightly inflated due to various effects, e.g. virtual memory, code being paged in, etc. The reason for averaging N iterations is that you get rid of artefacts caused by other processes, the scheduler, etc.
If you're running on Linux or similar You might also want to use taskset to pin your code to a specific CPU core (assuming it's single-threaded), ideally not core 0, since this tends to handle all interrupts.

Profiling C code on Windows when using Eclipse

I know I can profile my code with gprof and kprof on Linux. Is there a comparable alternative to these applications on Windows?
Commercial software:
Rational Quantify (expensive, slow, but very detailed)
AQTime (less expensive, less slow, a bit detailed)
Free software:
Very sleepy (
Luke StackWalker (
These commercial alternatives change the compiled code by 'instrumenting' (adding instructions) to it and perform the timing withing the added instructions. This means that they cause your application to slow down seriously.
These free alternatives use sampling, meaning they are less detailed, but very fast. In practice I found that especially Very Sleepy is very good to have a quick look at performance problems in your application.
There's a MinGW port of gprof that works just about the same as the Linux variant. You can either get a full MinGW installation (I think gprof is included but not sure) or get gprof from the MinGW binutils package.
For Eclipse, there's TPTP but it doesn't support profiling C/C++ as far as I know.
Yes, you can profile code with Visual Studio
What's the reason for profiling? Do you want to a) measure times and get a call graph, or b) find things to change to make the code faster? (These are not the same.)
If (b) you can use this trick, using the Pause button in Eclipse.
Added: Maybe it would help to convey some experience of what performance problems are actually like, and where you can expect to find them. Here are some simple examples:
An insertion sort (order n^2) where the items being sorted are strings, and are compared by a string-compare function. Where is the hot-spot? in string-compare. Where is the problem? In the sort where string-compare is called. If n=10 it's not a problem, but if n=1000, suddenly it takes a long time. The point where string-compare is called is "cold", but that's where the problem is. A small number of samples of the call stack pinpoint it with certainty.
An app that loads plugins takes a long time to start up. A profiler says basically everything in it is "cold". Something that measures I/O time says it is almost all I/O time, which seems like what you might expect, so it might seem hopeless. But, stack samples show a large percentage of time is spent with the stack about 20 layers deep in the process of reading the resource part of plugin dlls for the purpose of translating string constants into the local language. Investigating further, you find that most of the strings being translated are not the the kind that actually need translation. They were just put in "in case" they might need translation, and were never thought to be something that could cause a performance problem. Fixing that issue brings a hefty time savings.
So it is common to think in terms of "hotspots" and "bottlenecks", but most programs, especially the larger ones, tend to have performance problems in the form of function calls requesting work that doesn't really need to be done. Fortunately they display themselves on the call stack during the time that they are spending.

performance test of functions

linux gcc 4.4.1 C99
I am wondering what is the best way to test the performance of a C program.
I have some functions that I have implemented. However, I could have used a different design for each function.
Basically, I should want to test to see which design gives better performance.
Take a look at this post on code profilers.
I want to test to see which design gives better performance.
Why does it matter? This is not a flip question! You should have a performance target in mind, and if you meet it, your code is fast enough.
How do you know how fast is "fast enough"? It turns out the user-interface people have good data on the effect of response time on your users' experience:
0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. (Most people have a reaction time of about 0.1 seconds; jet fighter pilots get down to around 0.08s, i.e., 80ms.)
1 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of directly "driving" your application.
10 seconds is about the limit for keeping the user's attention focused on the app. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is hard to predict or varies a lot.
The quantitative results above apply only to interaction, of course, which is measured in seconds of waiting time. But even if your target is network packets sent, pages of RAM allocated, blocks of disk read/written, or just watts of power consumed, the message I am trying to communicate is that you should have a performance target, that target should be quantified, and the target should be connected to the needs of your users. If you don't have a quantifiable target, you're not doing engineering; you're just whistling in the dark. Unless your goal is to educate yourself (or to satisfy idle curiosity), the question you should be asking is "is my code good enough that I can move on?"
If you're not meeting your performance target, or if you are trying to educate yourself, I think the best combination of readable and detailed information comes from using the valgrind profiler (--tool=callgrind --dump-instr=yes) together with the kcachegrind visualizer.
Mostly you would like to use a profiler. The post pointed by Fragsworth is a good start. Personally, I prefer Shark for Mac OS X, and gprof for Linux.
But in your case, you may also call clock() or getrusage(), for example, in this way:
clock_t t = clock();
for (i = 0; i < 1000; ++i) my_func();
printf("time = %lf\n", (double)(clock() - t) / CLOCKS_PER_SEC);
Profiler is useful when you want to dig out which part of code takes most time. Calling clock()/getrusage() is more convenient (to me) when you want to compare/benchmark different implementations.
You can use gprof ,which is a free profiler .
The first thing to find out is whether you need to optimize those functions. Unless they are in the critical path for your code, they may be more then fast enough.
If you have profiled your application and found they are slow, one good way to test to performance is to call the function some large number of times and to find out the average time it takes to run.
You should also try to use CPU-time instead of wallclock-time as that is a more accurate gauge.
I addition to profiling you need to be running the code under test from a harness (driver) to average out the readings. In this way your comparisons are not skewed by one off readings, so you have a large sample population with mean and Standard Deviation to compare. There are many multi-threaded frameworks that can achieve the load driving for you.
