How to do good benchmarking of complex functions? - c

I am about to embark in very detailed benchmarking of a set of complex functions in C. This is "science level" detail. I'm wondering, what would be the best way to do serious benchmarking? I was thinking about running them, say, 10 times each, averaging the timing results and give the standard dev, for instance, just using <time.h>. What would you guys do to obtain good benchmarks?

Reporting an average and standard deviation gives a good description of a distribution when the distribution in question is approximately normal. However, this is rarely true of computational performance measurements. Instead, performance measurements tend to more closely resemble a poisson distribution. This makes sense, because not many random events on a computer will cause a program to go faster; essentially all of the measurement noise is in how many random events occur that cause it to slow down. (A normal distribution, by contrast, makes no intuitive sense at all; it would require the belief that a program has a non-zero probability of finishing in negative time).
In light of this, I find it most useful to report the minimum time over many runs of a program, rather than the average; the noise in the distribution is typically noise of the measuring system, rather than meaningful information about the algorithm. For complex algorithms that have early out conditions, and other shortcuts, you need to be a little more careful, but the minimum of many runs where each run handles a representative balance of inputs usually works well.
"10 times each" sounds like very few iterations to me. I generally do something on the order of thousands (or more, depending on the function/system) of runs unless that's completely infeasible. At a bare minimum, you need to make sure that you run the timing for sufficiently long as to shake out any dependence on system state, some of which may change at fairly large time granularity.
The other thing you should be aware of is that essentially every system has a platform-specific timer available that is much more accurate than what is available <time.h>. Find out what it is on your target platform[s] and use it instead.

I am assuming you are looking at benchmarking pure Algorithmic computation in your program and there is no user input or output which can take unpredictable time.
Now for purely number crunching programs, your results could vary based on the time your program actually runs which will be impacted by other ongoing activities in the system. There could be other factor which you may choose to ignore depending upon level of accuracy desired i.e. impact due to cache miss, different access time through the memory hierarchy"
One of the methods is as you suggested calculation average over a number of runs.
Or you could try to look at the assembly code and see the instructions generated. And then based on the processor get the cycle count for these instructions. This method may not be practical depending on the amount of code you are looking to benchmark. If you are particular about memory hierarchy impact then you may want to control execution environment very carefully i.e. where program is loaded, where its data is loaded etc. But as I mentioned depending on the accuracy desired, you may absorb the variation caused due to memory hierarchy in you statistical variation" .
You may need to carefully design the test input for you functions to ensure the path coverage and may choose to publish statistics of performance as a function of test input. This will show how function behaves across range of inputs

Related

Energy consumption of an algorithm in c code

I need to calculate the energy consumption of an algorithm in c code. Any ideas how this can be done and if there are pre-defined functions for that?
Thanks in advance,
It is nigh impossible to do it from C code alone. The multiple ways a C-compiler is allowed to translate a chunk of C-code do not allow for a precise calculation of the energy consumption. You need to know how the compiler translates which code for which architecture.
It is much simpler (for certain degrees of "simple") to count the assembler commands and multiply them with the respective latencies (Listed in the technical specs, e.g. for a randomly chosen MCU from the Wiki list: http://www.atmel.com/Images/doc32002.pdf). This is still not exact, division for example might take a different amount of CPU cycles depending on the input, CPU architecture and implementation in the hardware, but it comes quite close and is rather simple, although a bit tedious.
And than there are the loops with a completely unknown number of iterations, input taking different paths with different runtimes and much more. The people writing encryption software know more about it, especially how to avoid it. You might not like their solutions.
Otherwise: check what you expect for input (you do know which path which input takes, do you?), write a test program, go to e.g.: https://www.rohde-schwarz.com/ to get a good meter and measure the power consumption. You also need an engineer who knows how to do that, it's not easy!

Profiling floating point usage in C

Is there an easy way to count the number of multiplications actually executed by a piece of standard C code? The code I have in mind basically just does additions and multiplications, and it's the multiplications that are of primary interest, but it wouldn't hurt to get counts of the other operations as well.
If it were an option, I suppose I could go around replacing 'a * b' with 'multiply(a, b)' and write a cover function for the native * operator, b/c I really don't care about time performance during this test, but the primary objection to doing that is having to re-work a pile of source code just to run the test.
I have no objection to re-compiling the source, perhaps against some library or with obscure (afaik) options. Valgrind came to mind, but if I understand valgrind's purpose, that's more about tracing values than counting operations.
Compile the source code into assembly language and then search for the multiply instructions.
Note that the optimization level can greatly affect the number that appear. For loops, you would have to determine the scope of multiplies within a loop and factor that into the result, but if the code is fairly constrained or limited in extent, that should be straightforward.
Note: a shameless extrapolation of my comment for as much rep as I can skim.
PAPI has two high-level API functions called PAPI_flips and PAPI_flops which can be used to record the FLOPS as well as the number of floating point operations. Additionally, PAPI offers lots of other performance counter monitoring capability, depending on your processor architecture... cache, bus, memory, branches, etc. I think there is support or support is emerging for graphics accelerators and CUDA/GPGPU.
PAPI will need to be installed on your system, but I think it's widespread enough that installation wouldn't be too painful, if you know what you're doing.
The nice thing about PAPI is that you don't need to know anything about the code; just instrument it (the interface is the same as a stopwatch for FLOPS) and run it. It's based on the actual dynamic execution of your program, so it takes into account things that are hard to account for analytically, such as (pseudo-)random behavior, user/variable input, and related branches.
If your compiler supports soft-float (i.e. using functions with integer implementations to emulate floating-point), you could compiler your program in that mode (-msoft-float in GCC), and use your favorite profiling tool to measure how many times they are invoked.
Many processors also have performance counters that can count the number of floating-point operations that have been retired. Depending on the hardware and OS, you may or may not need some amount of kernel support to take advantage of them.
The best that I can think of is (assuming you're running gdb):
If you could identify the points were multiplications are occurring, you could then set tracepoints just prior to the multiplication (or perhaps just after them depending on the details), then run the program and count the number of tracepoint dumps.
Yes, it is very crude. Certainly there are other solutions; however, I would hesitate to trash my stack for something as simple as a count.

Randomness in Artificial Intelligence & Machine Learning

This question came to my mind while working on 2 projects in AI and ML. What If I'm building a model (e.g. Classification Neural Network,K-NN, .. etc) and this model uses some function that includes randomness. If I don't fix the seed, then I'm going to get different accuracy results every time I run the algorithm on the same training data. However, If I fix it then some other setting might give better results.
Is averaging a set of accuracies enough to say that the accuracy of this model is xx % ?
I'm not sure If this is the right place to ask such a question/open such a discussion.
Simple answer, yes, you randomize it and use statistics to show the accuracy. However, it's not sufficient to just average a handful of runs. You need, at a minimum, some notion of the variability as well. It's important to know whether "70%" accurate means "70% accurate for each of 100 runs" or "100% accurate once and 40% accurate once".
If you're just trying to play around a bit and convince yourself that some algorithm works, then you can just run it 30 or so times and look at the mean and standard deviation and call it a day. If you're going to convince anyone else that it works, you need to look into how to do more formal hypothesis testing.
There are models which are naturally dependent on randomness (e.g., random forests) and models which only use randomness as part of exploring the space (e.g., initialisation of values for neural networks), but actually have a well-defined, deterministic, objective function.
For the first case, you will want to use multiple seeds and report average accuracy, std. deviation, and the minimum you obtained. It is often good if you have a way to reproduce this, so just use multiple fixed seeds.
For the second case, you can always tell, just on the training data, which run is best (although it might actually not be the one which gives you the best test accuracy!). Thus, if you have the time, it is good to do say, 10 runs, and then evaluate on the one with the best training error (or validation error, just never evaluate on testing for this decision). You can go a level up and do multiple multiple runs and get a standard deviation too. However, if you find that this is significant, it probably means you weren't trying enough initialisations or that you are not using the right model for your data.
Stochastic techniques are typically used to search very large solution spaces where exhaustive search is not feasible. So it's almost inevitable that you will be trying to iterate over a large number of sample points with as even a distribution as possible. As mentioned elsewhere, basic statistical techniques will help you determine when your sample is big enough to be representative of the space as a whole.
To test accuracy, it is a good idea to set aside a portion of your input patterns and avoid training against those patterns (assuming you are learning from a data set). Then you can use the set to test whether your algorithm is learning the underlying pattern correctly, or whether it's simply memorizing the examples.
Another thing to think about is the randomness of your random number generator. Standard random number generators (such as rand from <stdlib.h>) may not make the grade in many cases so look around for a more robust algorithm.
I generalize the answer from what i get of your question,
I suppose Accuracy is always average accuracy of multiple runs and the standard deviation. So if you are considering accuracy you get using different seeds to the random generator, are you not actually considering a greater range of input (which should be a good thing). But you have to consider the Standard deviation to consider the accuracy. Or did i get your question it totally wrong ?
I believe cross-validation may give you what you ask about: an averaged, and therefore more reliable, estimate of classification performance. It contains no randomness, except in permuting the data set initially. The variation comes from choosing different train/test splits.

Can gdb or other tool be used to detect parts of a complex program (e.g. loops) that take more time than expected for targeting optimization?

As the title implies basically: Say we have a complex program and we want to make it faster however we can. Can we somehow detect which loops or other parts of its structure take most of the time for targeting them for optimizations?
edit: Notice, of importance is that the software is assumed to be very complex and we can't check each loop or other structure one by one, putting timers in them etc..
You're looking for a profiler. There are several around; since you mention gcc you might want to check gprof (part of binutils). There's also Google Perf Tools although I have never used them.
You can use GDB for that, by this method.
Here's a blow-by-blow example of using it to optimize a realistically complex program.
You may find "hotspots" that you can optimize, but more
generally the things that give you the greatest opportunity for saving time are mid-level function calls that you can avoid.
One example is, say, calling a function to extract information from a database, where the function is being called multiple times, when with some extra coding the result from a prior call could be used.
Often such calls are small and innocent-looking, and you're totally surprised to learn how much they're costing, as an overall percent of time.
Another example is doing some low-level I/O that escapes attention, but actually costs a hefty percent of clock time.
Another example is tidal waves of notifications that propagate from seemingly trivial changes to data.
Another good tool for finding these problems is Zoom.
Here's a discussion of the technical issues, but basically what to look for is:
It should tell you inclusive percent of time, at line-level resolution, not just functions.
a) Only knowing that a function is costly still leaves you wondering where the lines are in it that you should look at.
b) Inclusive percent tells the true cost of the line - how much bottom-line time it is responsible for and would not be spent if it were not there.
It should include both I/O (i.e. blocked) time and CPU time, not just CPU time. A tool that only considers CPU time will not see the first two problems mentioned above.
If your program is interactive, the tool should operate only during the time you care about, and not while waiting for user input. You don't want to include head-scratching time in your program's performance statistics.
gprof breaks it down by function. If you have many different loops in one function, it might not tell you which loop is taking the time. This is a clue to refactor ;-)

performance test of functions

linux gcc 4.4.1 C99
I am wondering what is the best way to test the performance of a C program.
I have some functions that I have implemented. However, I could have used a different design for each function.
Basically, I should want to test to see which design gives better performance.
Many thanks,
Take a look at this post on code profilers.
I want to test to see which design gives better performance.
Why does it matter? This is not a flip question! You should have a performance target in mind, and if you meet it, your code is fast enough.
How do you know how fast is "fast enough"? It turns out the user-interface people have good data on the effect of response time on your users' experience:
0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. (Most people have a reaction time of about 0.1 seconds; jet fighter pilots get down to around 0.08s, i.e., 80ms.)
1 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of directly "driving" your application.
10 seconds is about the limit for keeping the user's attention focused on the app. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is hard to predict or varies a lot.
The quantitative results above apply only to interaction, of course, which is measured in seconds of waiting time. But even if your target is network packets sent, pages of RAM allocated, blocks of disk read/written, or just watts of power consumed, the message I am trying to communicate is that you should have a performance target, that target should be quantified, and the target should be connected to the needs of your users. If you don't have a quantifiable target, you're not doing engineering; you're just whistling in the dark. Unless your goal is to educate yourself (or to satisfy idle curiosity), the question you should be asking is "is my code good enough that I can move on?"
If you're not meeting your performance target, or if you are trying to educate yourself, I think the best combination of readable and detailed information comes from using the valgrind profiler (--tool=callgrind --dump-instr=yes) together with the kcachegrind visualizer.
Mostly you would like to use a profiler. The post pointed by Fragsworth is a good start. Personally, I prefer Shark for Mac OS X, and gprof for Linux.
But in your case, you may also call clock() or getrusage(), for example, in this way:
clock_t t = clock();
for (i = 0; i < 1000; ++i) my_func();
printf("time = %lf\n", (double)(clock() - t) / CLOCKS_PER_SEC);
Profiler is useful when you want to dig out which part of code takes most time. Calling clock()/getrusage() is more convenient (to me) when you want to compare/benchmark different implementations.
You can use gprof ,which is a free profiler .
The first thing to find out is whether you need to optimize those functions. Unless they are in the critical path for your code, they may be more then fast enough.
If you have profiled your application and found they are slow, one good way to test to performance is to call the function some large number of times and to find out the average time it takes to run.
You should also try to use CPU-time instead of wallclock-time as that is a more accurate gauge.
I addition to profiling you need to be running the code under test from a harness (driver) to average out the readings. In this way your comparisons are not skewed by one off readings, so you have a large sample population with mean and Standard Deviation to compare. There are many multi-threaded frameworks that can achieve the load driving for you.

Resources