I am measuring how much time it takes to execute function compute_secret(). The pseudo code below is how I do my measurement.
The problem is that, there is a quite large different in between total_inside (the time its take to execute the function body, measure inside the callee) vs. total_outside (measure inside the caller). For example, total_inside = 127ms, total_outside = 238ms.
I understand that it takes sometime to call and return from function call. But it should not takes "that" long.
So what are the common reasons when calling and returning from function cost too much time ?
int encrypt(args){
time start_inside = get_current_time();
/**
actually compute stuff
**/
time end_inside = get_current_time();
time total_inside = end_inside - start_inside;
}
int main(){
time start_outside = get_current_time();
encrypt(args);
time end_outside = get_current_time();
time total_outside = end_outside - start_outside;
}
From a glance at the psudo-code, I see a number of things:
the first is that it's pseudo-code, which rather hampers our ability to do a full analysis :-) So comments below are based on what I see as most likely options.
you haven't actually posted the code for get_current_time(), the execution time of that may be irrelevant but that's not necessarily the case. Sometimes timing functions get in the way of accuarte timing.
the reason I mention that last point is that the outer time calculation will consist of the function calculations itself, the stack setup and teardown, and the inner time calculation. You should probably just run with the outer time to see if the timings are reduced.
because of the vagaries of code execution, it's often better to time something in a loop and average the time taken. For all we know, the first call may take 250ms and subsequent calls take 1ms.
If you're really concerned about stack setup and teardown (it's usually not too bad but it's definitely not free and, for very short-lived functions, it can often swamp the execution time of the actual work being done), there are options.
For example, you can choose higher optimisation levels, or force your function inline such as by making it a function macro, assuming you protect against all the usual macro-related issues, of course.
Related
I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.
I am measuring the cycle count of different C functions which I try to make constant time in order to mitigate side channel attacks (crypto).
I am working with a microcontroller (aurix from infineon) which has an onboard cycle counter which gets incremented each clock tick and which I can read out.
Consider the following:
int result[32], cnt=0;
int secret[32];
/** some other code***/
reset_and_startCounter(); //resets cycles to 0 and starts the counter
int tmp = readCycles(); //read cycles before function call
function(secret) //I want to measure this function, should be constant time
result[cnt++] = readCycles() - tmp; //read out cycles and subtract to get correct result
When I measure the cycles like shown above, I will sometimes receive a different amount of cycles depending on the input given to the function. (~1-10 cycles difference, function itself takes about 3000 cycles).
I was now wondering if it not yet is perfectly constant time, and that the calculations depend on some input. I looked into the function and did the following:
void function(int* input){
reset_and_startCounter();
int tmp = readCycles();
/*********************************
******calculations on input******
*********************************/
result[cnt++] = readCycles() - tmp;
}
and I received the same amount of cycles no matter what input is given.
I then also measured the time needed to call the function only, and to return from the function. Both measurements were the same no matter what input.
I was always using the gcc compiler flags -O3,-fomit-frame-pointer. -O3 because the runtime is critical and I need it to be fast. And also important, no other code has been running on the microcontroller (no OS etc.)
Does anyone have a possible explanation for this. I want to be secure, that my code is constant time, and those cycles are arbitrary...
And sorry for not providing a runnable code here, but I believe not many have an Aurix lying arround :O
Thank you
The Infineon Aurix microcontroller you're using is designed for hard real-time applications. It has intentionally been designed to provide consistent runtime performance -- it lacks most of the features that can lead to inconsistent performance on more sophisticated CPUs, like cache memory or branch prediction.
While showing that your code has constant runtime on this part is a start, it is still possible for your code to have variable runtime when run on other CPUs. It is also possible that a device containing this CPU may leak information through other channels, particularly through power analysis. If making your application resistant to sidechannel analysis is critical, you may want to consider using a part designed for cryptographic applications. (The Aurix is not such a part.)
I want to figure performance of small expressions,to I decide what to use. Consider the below code. Several recursivee calls to it may happen.
void foo(void) {
i++;
if(etc(ch)) {
//..
}
else if(ch == TOKX) {
p=1;
baa();
c=0;
p=0;
}
//more ifs
}
Question:
Recursives calls may happen to foo(),and i should be incremented only if p has non-zero value(it means that it will be used in other part of code) Should I put a if(p) i++; or only leave i++;?
Is to answer(myself) questions like this that I'm looking for some tool. Someonee can believe that it's "loss of time" or say "optmization is root of evil".. but for cases like this,I don't believe that it is applicable to my situation. IMHO. Tell us your opinion if you think otherwise.
A "ideal" tool,could to show how long time each expression take to run.
It make me to think how is software debugging in biggest software's companies like IBM,Microsoft,Sun etc. Maybe it's theme to another thread.. more useful that this here,I think.
Platform: Should be Linux and MS-Windows.
The old adage is something like "don't optimize until you're sure, absolutely positive, you need to".. and there are reasons for that.
That said, here are a few thoughts:
avoid recursion if you can
at a macro level, something like the "time" command in linux can tell you how long your app is running. Put the method in a loop that runs 10k times and measure that, to average out the numbers
if you want to measure time spent in individual functions, profiling is what you want. Visual Studio has some good built-in stuff for this in Windows, but there are many, many options.
http://en.wikipedia.org/wiki/List_of_performance_analysis_tools
First please understand which measurements matter: there's total wall-clock time taken by the program, and there's percent of time each statement is active, where "active" means "on the stack".
Total wall-clock time is easily measured by subtracting system time after from system time before. If it is very short, just loop the code 1000 times, or whatever. You don't need many digits of precision.
Percent of time each statement is active is best measured by means of stack samples taken on wall-clock time (not CPU-only time). Any good profiler based on wall-clock stack sampling will work, such as Zoom or maybe Oprofile. It's not just the taking of samples that's important, but what is presented to you. It is best if it tells you "inclusive percent by line of code", which is simply the percent of stack samples containing the line of code. Again, you don't need many digits of precision, which means you don't need an enormous number of samples.
The reason inclusive percent by line of code is important, as opposed to other measurements (like self-time, function measurements, invocation counts, milliseconds, and so on) is that it represents the fraction of total wall clock time that line is responsible for, and would not be spent if it were not there.
If you could get rid of it, that tells you how much time it would save.
I've got an auxiliary function that does some operations that are pretty costly.
I'm trying to profile the main section of the algorithm, but this auxiliary function gets called a lot within. Consequently, the measured time takes into account the auxillary function's time.
To solve this, I decided to set and restore the time so that the auxillary function appears to be instantaneous. I defined the following macros:
#define TIME_SAVE struct timeval _time_tv; gettimeofday(&_time_tv,NULL);
#define TIME_RESTORE settimeofday(&_time_tv,NULL);
. . . and used them as the first and last lines of the auxiliary function. For some reason, though, the auxiliary function's overhead is still included!
So, I know this is kind of a messy solution, and so I have since moved on, but I'm still curious as to why this idea didn't work.
Can someone please explain why?
If you insist on profiling this way, do not set the system clock. This will break all sorts of things, if you have permission to do it. Basically you should forget you ever heard of settimeofday. What you want to do is call gettimeofday both before and after the function you want to exclude from measurement, and compute the difference. You can then exclude the time spent in this function from the overall time.
With that said, this whole method of "profiling" is highly flawed, because gettimeofday probably (1) takes a significant amount of time compared to what you're trying to measure, and (2) probably involves a transition into kernelspace, which will do some serious damage to your program's cache coherency. This second problem, whereby in attempting to observe your program's performance characteristics you actually change them, is the most problematic.
What you really should do is forget about this kind of profiling (gettimeofday or even gcc's -pg/gmon profiling) and instead use oprofile or perf or something similar. These modern profiling techniques work based on statistically sampling the instruction pointer and stack information periodically; your program's own code is not modified at all, so it behaves as closely as possible to how it would behave with no profiler running.
There are a couple possibilities that may be occurring. One is that Linux tries to keep the clock accurate and adjustments to the clock may be 'smoothed' or otherwise 'fixed up' to try to keep a smooth sense of time within the system. If you are running NTP, it will also try to maintain a reasonable sense of time.
My approach would have been to not modify the clock but instead track time consumed by each portion of the process. The calls to the expensive part would be accumulated (by getting the difference between gettimeofday on entry and exit, and accumulating) and subtracting that from overall time. There are other possibilities for fancier approaches, I'm sure.
As the title implies basically: Say we have a complex program and we want to make it faster however we can. Can we somehow detect which loops or other parts of its structure take most of the time for targeting them for optimizations?
edit: Notice, of importance is that the software is assumed to be very complex and we can't check each loop or other structure one by one, putting timers in them etc..
You're looking for a profiler. There are several around; since you mention gcc you might want to check gprof (part of binutils). There's also Google Perf Tools although I have never used them.
You can use GDB for that, by this method.
Here's a blow-by-blow example of using it to optimize a realistically complex program.
You may find "hotspots" that you can optimize, but more
generally the things that give you the greatest opportunity for saving time are mid-level function calls that you can avoid.
One example is, say, calling a function to extract information from a database, where the function is being called multiple times, when with some extra coding the result from a prior call could be used.
Often such calls are small and innocent-looking, and you're totally surprised to learn how much they're costing, as an overall percent of time.
Another example is doing some low-level I/O that escapes attention, but actually costs a hefty percent of clock time.
Another example is tidal waves of notifications that propagate from seemingly trivial changes to data.
Another good tool for finding these problems is Zoom.
Here's a discussion of the technical issues, but basically what to look for is:
It should tell you inclusive percent of time, at line-level resolution, not just functions.
a) Only knowing that a function is costly still leaves you wondering where the lines are in it that you should look at.
b) Inclusive percent tells the true cost of the line - how much bottom-line time it is responsible for and would not be spent if it were not there.
It should include both I/O (i.e. blocked) time and CPU time, not just CPU time. A tool that only considers CPU time will not see the first two problems mentioned above.
If your program is interactive, the tool should operate only during the time you care about, and not while waiting for user input. You don't want to include head-scratching time in your program's performance statistics.
gprof breaks it down by function. If you have many different loops in one function, it might not tell you which loop is taking the time. This is a clue to refactor ;-)