How to figure performance of small expressions? - c

I want to figure performance of small expressions,to I decide what to use. Consider the below code. Several recursivee calls to it may happen.
void foo(void) {
i++;
if(etc(ch)) {
//..
}
else if(ch == TOKX) {
p=1;
baa();
c=0;
p=0;
}
//more ifs
}
Question:
Recursives calls may happen to foo(),and i should be incremented only if p has non-zero value(it means that it will be used in other part of code) Should I put a if(p) i++; or only leave i++;?
Is to answer(myself) questions like this that I'm looking for some tool. Someonee can believe that it's "loss of time" or say "optmization is root of evil".. but for cases like this,I don't believe that it is applicable to my situation. IMHO. Tell us your opinion if you think otherwise.
A "ideal" tool,could to show how long time each expression take to run.
It make me to think how is software debugging in biggest software's companies like IBM,Microsoft,Sun etc. Maybe it's theme to another thread.. more useful that this here,I think.
Platform: Should be Linux and MS-Windows.

The old adage is something like "don't optimize until you're sure, absolutely positive, you need to".. and there are reasons for that.
That said, here are a few thoughts:
avoid recursion if you can
at a macro level, something like the "time" command in linux can tell you how long your app is running. Put the method in a loop that runs 10k times and measure that, to average out the numbers
if you want to measure time spent in individual functions, profiling is what you want. Visual Studio has some good built-in stuff for this in Windows, but there are many, many options.
http://en.wikipedia.org/wiki/List_of_performance_analysis_tools

First please understand which measurements matter: there's total wall-clock time taken by the program, and there's percent of time each statement is active, where "active" means "on the stack".
Total wall-clock time is easily measured by subtracting system time after from system time before. If it is very short, just loop the code 1000 times, or whatever. You don't need many digits of precision.
Percent of time each statement is active is best measured by means of stack samples taken on wall-clock time (not CPU-only time). Any good profiler based on wall-clock stack sampling will work, such as Zoom or maybe Oprofile. It's not just the taking of samples that's important, but what is presented to you. It is best if it tells you "inclusive percent by line of code", which is simply the percent of stack samples containing the line of code. Again, you don't need many digits of precision, which means you don't need an enormous number of samples.
The reason inclusive percent by line of code is important, as opposed to other measurements (like self-time, function measurements, invocation counts, milliseconds, and so on) is that it represents the fraction of total wall clock time that line is responsible for, and would not be spent if it were not there.
If you could get rid of it, that tells you how much time it would save.

Related

gprof is showing 0 computational time [duplicate]

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

gettimeofday/settimeofday for Making a Function Appear to Take No Time

I've got an auxiliary function that does some operations that are pretty costly.
I'm trying to profile the main section of the algorithm, but this auxiliary function gets called a lot within. Consequently, the measured time takes into account the auxillary function's time.
To solve this, I decided to set and restore the time so that the auxillary function appears to be instantaneous. I defined the following macros:
#define TIME_SAVE struct timeval _time_tv; gettimeofday(&_time_tv,NULL);
#define TIME_RESTORE settimeofday(&_time_tv,NULL);
. . . and used them as the first and last lines of the auxiliary function. For some reason, though, the auxiliary function's overhead is still included!
So, I know this is kind of a messy solution, and so I have since moved on, but I'm still curious as to why this idea didn't work.
Can someone please explain why?
If you insist on profiling this way, do not set the system clock. This will break all sorts of things, if you have permission to do it. Basically you should forget you ever heard of settimeofday. What you want to do is call gettimeofday both before and after the function you want to exclude from measurement, and compute the difference. You can then exclude the time spent in this function from the overall time.
With that said, this whole method of "profiling" is highly flawed, because gettimeofday probably (1) takes a significant amount of time compared to what you're trying to measure, and (2) probably involves a transition into kernelspace, which will do some serious damage to your program's cache coherency. This second problem, whereby in attempting to observe your program's performance characteristics you actually change them, is the most problematic.
What you really should do is forget about this kind of profiling (gettimeofday or even gcc's -pg/gmon profiling) and instead use oprofile or perf or something similar. These modern profiling techniques work based on statistically sampling the instruction pointer and stack information periodically; your program's own code is not modified at all, so it behaves as closely as possible to how it would behave with no profiler running.
There are a couple possibilities that may be occurring. One is that Linux tries to keep the clock accurate and adjustments to the clock may be 'smoothed' or otherwise 'fixed up' to try to keep a smooth sense of time within the system. If you are running NTP, it will also try to maintain a reasonable sense of time.
My approach would have been to not modify the clock but instead track time consumed by each portion of the process. The calls to the expensive part would be accumulated (by getting the difference between gettimeofday on entry and exit, and accumulating) and subtracting that from overall time. There are other possibilities for fancier approaches, I'm sure.

Can gdb or other tool be used to detect parts of a complex program (e.g. loops) that take more time than expected for targeting optimization?

As the title implies basically: Say we have a complex program and we want to make it faster however we can. Can we somehow detect which loops or other parts of its structure take most of the time for targeting them for optimizations?
edit: Notice, of importance is that the software is assumed to be very complex and we can't check each loop or other structure one by one, putting timers in them etc..
You're looking for a profiler. There are several around; since you mention gcc you might want to check gprof (part of binutils). There's also Google Perf Tools although I have never used them.
You can use GDB for that, by this method.
Here's a blow-by-blow example of using it to optimize a realistically complex program.
You may find "hotspots" that you can optimize, but more
generally the things that give you the greatest opportunity for saving time are mid-level function calls that you can avoid.
One example is, say, calling a function to extract information from a database, where the function is being called multiple times, when with some extra coding the result from a prior call could be used.
Often such calls are small and innocent-looking, and you're totally surprised to learn how much they're costing, as an overall percent of time.
Another example is doing some low-level I/O that escapes attention, but actually costs a hefty percent of clock time.
Another example is tidal waves of notifications that propagate from seemingly trivial changes to data.
Another good tool for finding these problems is Zoom.
Here's a discussion of the technical issues, but basically what to look for is:
It should tell you inclusive percent of time, at line-level resolution, not just functions.
a) Only knowing that a function is costly still leaves you wondering where the lines are in it that you should look at.
b) Inclusive percent tells the true cost of the line - how much bottom-line time it is responsible for and would not be spent if it were not there.
It should include both I/O (i.e. blocked) time and CPU time, not just CPU time. A tool that only considers CPU time will not see the first two problems mentioned above.
If your program is interactive, the tool should operate only during the time you care about, and not while waiting for user input. You don't want to include head-scratching time in your program's performance statistics.
gprof breaks it down by function. If you have many different loops in one function, it might not tell you which loop is taking the time. This is a clue to refactor ;-)

microsecond profiler for C code

Does any body know of C code profiler like gprof which gives function call times in microseconds instead of milliseconds?
Take a look at Linux perf. You will need a pretty recent kernel though.
Let me just suggest how I would handle this, assuming you have the source code.
Knowing how long a function takes inclusively per invocation (including I/O), on average, multiplied by the number of invocations, divided by the total running time, would give you the fraction of time under the control of that function. That fraction is how you know if the function is a sufficient time-taker to bother optimizing. That is not easy information to get from gprof.
Another way to learn what fraction of inclusive time is spent under the control of each function is timed or random sampling of the call stack. If a function appears on a fraction X of the samples (even if it appears more than once in a sample), then X is the time-fraction it takes (within a margin of error). What's more, this gives you per-line fraction of time, not just per-function.
That fraction X is the most valuable information you can get, because that is the total amount of time you could potentially save by optimizing that function or line of code.
The Zoom profiler is a good tool for getting this information.
What I would do is wrap a long-running loop around the top-level code, so that it executes repeatedly, long enough to take at least several seconds. Then I would manually sample the stack by interrupting or pausing it at random. It actually takes very few samples, like 10 or 20, to get a really clear picture of the most time-consuming functions and/or lines of code.
Here's an example.
P.S. If you're worried about statistical accuracy, let me get quantitative. If a function or line of code is on the stack exactly 50% of the time, and you take 10 samples, then the number of samples that show it will be 5 +/- 1.6, for a margin of error of 16%. If the actual time is smaller or larger, the margin of error shrinks. You can also reduce the margin of error by taking more samples. To get 1.6%, take 1000 samples. Actually, once you've found the problem, it's up to you to decide if you need a smaller margin of error.
gprof gives results either in milliseconds or in microseconds. I do not know the exact rationale, but my experience is that it will display results in microseconds when it thinks that there is enough precision for it. To get microsecond output, you need to run the program for longer time and/or do not have any routine that takes too much time to run.
oprofile gets you times in clock resolution, i.e. nanoseconds, it produces output files compatible with gprof so very convenient to use.
http://oprofile.sourceforge.net/news/

performance test of functions

linux gcc 4.4.1 C99
I am wondering what is the best way to test the performance of a C program.
I have some functions that I have implemented. However, I could have used a different design for each function.
Basically, I should want to test to see which design gives better performance.
Many thanks,
Take a look at this post on code profilers.
I want to test to see which design gives better performance.
Why does it matter? This is not a flip question! You should have a performance target in mind, and if you meet it, your code is fast enough.
How do you know how fast is "fast enough"? It turns out the user-interface people have good data on the effect of response time on your users' experience:
0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. (Most people have a reaction time of about 0.1 seconds; jet fighter pilots get down to around 0.08s, i.e., 80ms.)
1 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of directly "driving" your application.
10 seconds is about the limit for keeping the user's attention focused on the app. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is hard to predict or varies a lot.
The quantitative results above apply only to interaction, of course, which is measured in seconds of waiting time. But even if your target is network packets sent, pages of RAM allocated, blocks of disk read/written, or just watts of power consumed, the message I am trying to communicate is that you should have a performance target, that target should be quantified, and the target should be connected to the needs of your users. If you don't have a quantifiable target, you're not doing engineering; you're just whistling in the dark. Unless your goal is to educate yourself (or to satisfy idle curiosity), the question you should be asking is "is my code good enough that I can move on?"
If you're not meeting your performance target, or if you are trying to educate yourself, I think the best combination of readable and detailed information comes from using the valgrind profiler (--tool=callgrind --dump-instr=yes) together with the kcachegrind visualizer.
Mostly you would like to use a profiler. The post pointed by Fragsworth is a good start. Personally, I prefer Shark for Mac OS X, and gprof for Linux.
But in your case, you may also call clock() or getrusage(), for example, in this way:
clock_t t = clock();
for (i = 0; i < 1000; ++i) my_func();
printf("time = %lf\n", (double)(clock() - t) / CLOCKS_PER_SEC);
Profiler is useful when you want to dig out which part of code takes most time. Calling clock()/getrusage() is more convenient (to me) when you want to compare/benchmark different implementations.
You can use gprof ,which is a free profiler .
The first thing to find out is whether you need to optimize those functions. Unless they are in the critical path for your code, they may be more then fast enough.
If you have profiled your application and found they are slow, one good way to test to performance is to call the function some large number of times and to find out the average time it takes to run.
You should also try to use CPU-time instead of wallclock-time as that is a more accurate gauge.
I addition to profiling you need to be running the code under test from a harness (driver) to average out the readings. In this way your comparisons are not skewed by one off readings, so you have a large sample population with mean and Standard Deviation to compare. There are many multi-threaded frameworks that can achieve the load driving for you.

Resources