I've got an auxiliary function that does some operations that are pretty costly.
I'm trying to profile the main section of the algorithm, but this auxiliary function gets called a lot within. Consequently, the measured time takes into account the auxillary function's time.
To solve this, I decided to set and restore the time so that the auxillary function appears to be instantaneous. I defined the following macros:
#define TIME_SAVE struct timeval _time_tv; gettimeofday(&_time_tv,NULL);
#define TIME_RESTORE settimeofday(&_time_tv,NULL);
. . . and used them as the first and last lines of the auxiliary function. For some reason, though, the auxiliary function's overhead is still included!
So, I know this is kind of a messy solution, and so I have since moved on, but I'm still curious as to why this idea didn't work.
Can someone please explain why?
If you insist on profiling this way, do not set the system clock. This will break all sorts of things, if you have permission to do it. Basically you should forget you ever heard of settimeofday. What you want to do is call gettimeofday both before and after the function you want to exclude from measurement, and compute the difference. You can then exclude the time spent in this function from the overall time.
With that said, this whole method of "profiling" is highly flawed, because gettimeofday probably (1) takes a significant amount of time compared to what you're trying to measure, and (2) probably involves a transition into kernelspace, which will do some serious damage to your program's cache coherency. This second problem, whereby in attempting to observe your program's performance characteristics you actually change them, is the most problematic.
What you really should do is forget about this kind of profiling (gettimeofday or even gcc's -pg/gmon profiling) and instead use oprofile or perf or something similar. These modern profiling techniques work based on statistically sampling the instruction pointer and stack information periodically; your program's own code is not modified at all, so it behaves as closely as possible to how it would behave with no profiler running.
There are a couple possibilities that may be occurring. One is that Linux tries to keep the clock accurate and adjustments to the clock may be 'smoothed' or otherwise 'fixed up' to try to keep a smooth sense of time within the system. If you are running NTP, it will also try to maintain a reasonable sense of time.
My approach would have been to not modify the clock but instead track time consumed by each portion of the process. The calls to the expensive part would be accumulated (by getting the difference between gettimeofday on entry and exit, and accumulating) and subtracting that from overall time. There are other possibilities for fancier approaches, I'm sure.
Related
In many debates about the inline keyword in function declarations, someone will point that it can actually make your program slower in some cases – mostly due to code explosion, if I am correct. I have never met such an example in practice myself. What is an actual code where the use of inline can be expected to be detrimental to the performance?
Exactly 10 years and one day ago I did this commit in OpenBSD:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/include/intr.h.diff?r1=1.3;r2=1.4
The commit message was:
deinline splraise, spllower and setsoftint.
Makes the kernel smaller and faster.
deraadt# ok
As far as I remember the kernel binary shrunk by more than 100kB and not a single test case could be produced that became slower and several macro benchmarks (like compiling the kernel) were measurably faster (5-10% if I recall correctly, but don't quote me on that).
Around the same time I went on a quest to actually measure inline functions in the OpenBSD kernel. I found a few that had minimal performance gains, but the majority had 0 measurable impact and several were making things much slower and were killed. At least one more uninlining had a huge impact and that one was the internal malloc macros (where the idea was to inline malloc if it had a size known at compile time) and packet buffer allocators that shrunk the kernel by 150kB and had a significant performance improvement.
One could speculate, although I have no proof, that this is because the kernel is large and we're struggling to stay inside the cache when executing system calls and every little bit helps. So what actually helped in those cases was just the shrinking of the binary, not the number of instructions executed.
Imagine a function that have no parameters, but intensive computation with a consistent number of intermediate values or register usage. Then Inline that function in code having a consistent number of intermediate values or register usage too.
Having no parameters make the call procedure more lightweight because no stack operations, that are time consuming, are required.
When inlined the compiler have to save many registers, and spill other to be used with the new function, reproducing the process of registers and data backup required for a function call possibly in worst way.
If the backup operations are more expansive, in terms of time and machine cycles, compared with the mechanism of function call, especially if the function is extensively called, then you have a detrimental effect.
This seems to be the case of some specific functions largely used in an OS.
I'm coding a little program that has to sort a large array (up to 4 million text strings). Seems like I'm doing quite well at it, since a combination of radixsort and mergesort already cut the original q(uick)sort execution time in less than half.
Execution time being the main point, since this is what I'm using to benchmark my piece of code.
My question is:
Is there a better (i. e. more reliable) way of benchmarking a program than just time the execution? It kinda works, but the same program (with the same background processes running) usually has slightly different execution times if run twice.
This kinda defeats the purpose of detecting small improvements. And several small improvements could add up to a big one...
Thanks in advance for any input!
Results:
I managed to get gprof to work under Windows (using gcc and MinGW). gcc behaves poorly (considering execution time) compared to my normal compiler (tcc), but it gave me quite some insight.
Try a profiling tool, that will also show you where the program is spending its time. gprof is the classic C profiling tool, at least on Unix.
Look at the time command. It tracks both the CPU time a process uses and the wall-clock time. You can also use something like gprof for profiling your code to find the parts of your program that are actually taking the most time. You could do a lower-tech version of profiling with timers in your code. Boost has a nice timer class, but it's easy to roll your own.
I don't think it's sufficient to just measure how long a piece of code takes to execute. Your environment is a constantly changing thing, so you have to take a statistical approach to measuring execution time.
Essentially you need to take N measurements, discard outliers, and calculate your average, median and standard deviation running time, with an uncertainty measurement.
Here's a good blog explaining why and how to do this (with code): http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html
What do you use for timing execution time so far? There's C89 clock() in time.h for starters. On unixoid systems you might find getitimer() for ITIMER_VIRTUAL to measure process CPU time. See the respective manual pages for details.
You can also use a POSIX shell's times utility to benchmark the processor time used by a process and its children. The resolution is system dependent, like just anything about profiling. Try to wrap your C code in a loop, executing it as many times as necessary to reduce the "jitter" in the time the benchmarking reports.
Call your routine from a test harness, whereby it executes N + 1 times. Ignore the timing for the first iteration and then take the average of iterations 1..N. The reason for ignoring the first time is that is is often slightly inflated due to various effects, e.g. virtual memory, code being paged in, etc. The reason for averaging N iterations is that you get rid of artefacts caused by other processes, the scheduler, etc.
If you're running on Linux or similar You might also want to use taskset to pin your code to a specific CPU core (assuming it's single-threaded), ideally not core 0, since this tends to handle all interrupts.
As the title implies basically: Say we have a complex program and we want to make it faster however we can. Can we somehow detect which loops or other parts of its structure take most of the time for targeting them for optimizations?
edit: Notice, of importance is that the software is assumed to be very complex and we can't check each loop or other structure one by one, putting timers in them etc..
You're looking for a profiler. There are several around; since you mention gcc you might want to check gprof (part of binutils). There's also Google Perf Tools although I have never used them.
You can use GDB for that, by this method.
Here's a blow-by-blow example of using it to optimize a realistically complex program.
You may find "hotspots" that you can optimize, but more
generally the things that give you the greatest opportunity for saving time are mid-level function calls that you can avoid.
One example is, say, calling a function to extract information from a database, where the function is being called multiple times, when with some extra coding the result from a prior call could be used.
Often such calls are small and innocent-looking, and you're totally surprised to learn how much they're costing, as an overall percent of time.
Another example is doing some low-level I/O that escapes attention, but actually costs a hefty percent of clock time.
Another example is tidal waves of notifications that propagate from seemingly trivial changes to data.
Another good tool for finding these problems is Zoom.
Here's a discussion of the technical issues, but basically what to look for is:
It should tell you inclusive percent of time, at line-level resolution, not just functions.
a) Only knowing that a function is costly still leaves you wondering where the lines are in it that you should look at.
b) Inclusive percent tells the true cost of the line - how much bottom-line time it is responsible for and would not be spent if it were not there.
It should include both I/O (i.e. blocked) time and CPU time, not just CPU time. A tool that only considers CPU time will not see the first two problems mentioned above.
If your program is interactive, the tool should operate only during the time you care about, and not while waiting for user input. You don't want to include head-scratching time in your program's performance statistics.
gprof breaks it down by function. If you have many different loops in one function, it might not tell you which loop is taking the time. This is a clue to refactor ;-)
I have programmed an embedded software (using C of course) and now I'm considering ways to improve the running time of the system. The most important single module in my system is one very large nested for loop module.
That module consists of two nested for loops that loops max 122500 times. That's not very much yet, but the problem is that inside that nested for loop I have a function call to a function that is in another source file. That specific function consists mostly of two another nested for loops which loops always 22500 times. So now I have to make a function call 122500 times.
I have made that function that is to be called a lot lighter and shorter (yet still works as it should) and now I started to think that would it be faster to rip off that function call and write that process directly inside those first two for loops?
The processor in that system is ARM7TDMI and its frequency is 55MHz. The system itself isn't very time critical so it doesn't have to be real time capable. However the faster it can process its duties the better.
Also would it be also faster to use while loops instead of fors? And any piece of advice about how to improve the running time is appreciated.
-zaplec
TRY IT AND SEE!!
It'll almost certainly make a difference. Function call overhead isn't usually that much of an issue, but at over 100K repetitions it starts to add up.
...But whether or not it makes any real-world difference is something only you can answer, after trying it and timing the results.
As for for vs while... it shouldn't matter unless you actually change the behavior when changing the loop. If in doubt, make your compiler spit out assembler code for both and compare... or just change it and time it.
You need to be careful in the optimizations you make because you aren't always clear on which optimizations the compiler is making for you. Pre-optimization is a common mistake people make. Is it important that your code is readable and easily maintained or slightly faster? Like others have suggested, the best approach is to benchmark the different ways and see if there is a noticeable difference.
If you don't believe your compiler does much in the way of optimization I would look at some older concepts in optimizing C (searches on SO or google should provide some good links).
The ARM processor has an instruction pipeline (cache). When the processor encounters a branch (call) instruction, it must clear the pipeline and reload, thus wasting some time. One objective when optimizing for speed is to reduce the number of reloads to the instruction pipeline. This means reducing branch instructions.
As others have stated in SO, compile your code with optimization set for speed, and profile. I prefer to look at the assembly language listing as well (either printed from the compiler or displayed interwoven in the debugger). Use this as a baseline. If you can't profile, you can use assembly instruction counting as a rough estimate.
The next step is to reduce the number of branches; or the number times a branch is taken. Unrolling loops helps to reduce the number of times a branch is taken. Inlining helps reduce the number of branches. Before applying this fine-tuning techniques, review the design and code implementation to see if branches can be reduced. For example, reduce the number of "if" statements by using Boolean arithmetic or using Karnaugh Maps. My favorite is reducing requirements and eliminating code that doesn't need to be executed.
In the code implementation, move code that doesn't change outside of the for or while loops. Some loops may be reduce to equations (example, replacing a loop of additions with a multiplication). Also, reduce the quantity of iterations, by asking "does this loop really need to be executed this many times").
Another technique is to optimize for Data Oriented Design. Also check this reference.
Just remember to set a limit for optimizing. This is where you decide any more optimization is not generating any ROI or customer satisfaction. Also, apply optimizations in stages; which will allow you to have a deliverable when your manager asks for one.
Run a profiler on your code. If you are just guessing at where you are spending your time, you are probably wrong. A profiler will show what function is taking the most time and you can focus on that. You could be doing something in the function that takes longer than the function call itself. Did you look to see if you can change floating operations to integer, or integer math to shifts? You can spend a lot of time fiddling with things that don't make much difference. Run a profiler on your code and know for sure that the things you are changing will make a difference.
For function vs. inline, unfortunately there is no easy answer. I.e. it depends. See this FAQ. For "for" vs. "while", I wouldn't think there is any significant difference in performance.
In general, a function call should have more overhead than inlining. You really should profile however, as this can be affected quite a bit by your compiler (especially the compile/optimization settings). Some compilers will automatically inline code for example.
I'm doing some prototyping work in C, and I want to compare how long a program takes to complete with various small modifications.
I've been using clock; from K&R:
clock returns the processor time used by the program since the beginning of execution, or -1 if unavailable.
This seems sensible to me, and has been giving results which broadly match my expectations. But is there something better to use to see what modifications improve/worsen the efficiency of my code?
Update: I'm interested in both Windows and Linux here; something that works on both would be ideal.
Update 2: I'm less interested in profiling a complex problem than total run time/clock cycles used for a simple program from start to finish—I already know which parts of my program are slow. clock appears to fit this bill, but I don't know how vulnerable it is to, for example, other processes running in the background and chewing up processor time.
Forget time() functions, what you need is:
Valgrind!
And KCachegrind is the best gui for examining callgrind profiling stats. In the past I have ported applications to linux just so I could use these tools for profiling.
For a rough measurement of overall running time, there's time ./myprog.
But for performance measurement, you should be using a profiler. For GCC, there is gprof.
This is both assuming a Unix-ish environment. I'm sure there are similar tools for Windows, but I'm not familiar with them.
Edit: For clarification: I do advise against using any gettime() style functions in your code. Profilers have been developed over decades to do the job you are trying to do with five lines of code, and provide a much more powerful, versatile, valuable, and fool-proof way to find out where your code spends its cycles.
I've found that timing programs, and finding things to optimize, are two different problems, and for both of them I personally prefer low-tech.
For timing, the trick is to make it take long enough by wrapping a loop around it. For example, if you iterate an operation 1000 times and time it with a stopwatch, then seconds become milliseconds when you remove the loop.
For finding things to optimize, there are pieces of code (terminal instructions and function calls) that are responsible for various fractions of the time. During that time, they are exposed on the stack. So you can wrap a loop around the program to make it take long enough, and then take stackshots. The code to optimize will jump out at you.
In POSIX (e.g. on Linux), you can use gettimeofday() to get higher-precision timing values (microseconds).
In Win32, QueryPerformanceCounter() is popular.
Beware of CPU clock-changing effects, if your CPU decides to clock down during the test, results may be skewed.
If you can use POSIX functions, have a look at clock_gettime. I found an example from a quick google search on how to use it. To measure processor time taken by your program, you need to pass CLOCK_PROCESS_CPUTIME_ID as the first argument to clock_gettime, if your system supports it. Since clock_gettime uses struct timespec, you can probably get useful nanosecond resolution.
As others have said, for any serious profiling work, you will need to use a dedicated profiler.