How can I tell Google Benchmark to not benchmark a line of code? - benchmarking

I am using Google Benchmark to benchmark a library. I have the benchmark set up as follows:
for (auto _ : state) {
run_function(first, last, v);
}
What I would like is for v to be randomly generated every iteration so that I could get a range of benchmark values and obtain the statistics from them. I can do this via:
std::random_device rand_dev;
std::mt19937 generator(rand_dev());
std::uniform_int_distribution<int> distr(min, max);
for (auto _ : state) {
v = distr(generator)
run_function(first, last, v);
}
Some of the functions I am testing are on the order of 10-100ns, so adding in the generator has a significant effect on the results. Is there any way to tell Google Bench to skip a line/block of code?

You can use the PauseTiming and ResumeTiming methods on the State object to pause and resume the timing. However, it may introduce some overhead to the timing loop. If the function under benchmark is very fast you may notice it.

No, it times multiple iterations of the whole loop, instead of inserting timer start/stop around individual statements or even whole iterations.
Timing is expensive, too, e.g. even rdtsc on x86 takes ~20 clock cycles, and getting useful results from it would require serializing out-of-order execution (e.g. with lfence), destroying performance for many kinds of loops where each iteration does independent work. So it's really not very viable or realistic to time short functions on their own. Google Benchmark leaves it up to you to construct a loop that benchmarks on throughput (independent iterations) or latency (feed the output of one call into an input to next iter.)
You have to figure out how to create a whole loop you want to benchmark if you want meaningful results for very small amounts of work. You can get results with PauseTiming and ResumeTiming, but they will massively distort things if the timed portion of each iteration isn't at least a few hundred asm instructions, preferably much larger than the CPU's out-of-order exec window.
In this case, use a very cheap PRNG like xorshift+.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths re: the impact of serializing execution.
Also Idiomatic way of performance evaluation? re: benchmark pitfalls.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

In which scenario would "unroll-loops" not making result code faster?

Taken from GCC manual:
-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
-funroll-loops implies -frerun-cse-after-loop. This option makes code larger, and may or may not make it
run faster.
According to my understanding, unroll loops will get rid of branching instructions in resulted code, I presume it is healthier for CPU pipelines.
But why would it "may not make it run faster"?
First of all, it may not make any difference; if your condition is "simple" and executed many times the branch predictor should quickly pick it up and always predict correctly the branch until the end of the loop, making the "rolled" code run almost as fast as the unrolled code.
Also, on non-pipelined CPUs the cost of a branch is quite small, so such optimization may not be relevant and code size considerations may be much more important (e.g. when compiling for a microcontroller - remember that gcc targets range from AVR micros to supercomputers).
Another case where unrolling can't speed up a loop is when the loop body is much slower than the looping itself - if e.g. you have a syscall in the body loop the loop overhead will be negligible compared to the system call.
As for when it may make your code run slower, making the code bigger can slow it down - if your code doesn't fit anymore in cache/memory page/... you'll have a cache/page/... fault and the processor will have to wait for the memory to fetch the code before executing it.
The answers so far are very good, but I'll add one thing that hasn't been touched on yet: eating up branch predictor slots. If your loop contains a branch, and it's not unrolled, it only consumes one branch predictor slot, so it won't evict other predictions the cpu has made in the outer loops, sister loops, or caller. However, if the loop body is duplicated many times via unrolling, each copy will contain a separate branch which consumes a predictor slot. This kind of performance hit is easily unnoticed, because, like cache eviction issues, it will not be visible in most isolated, artificial measurements of the loop performance. Instead, it will manifest as hurting the performance of other code.
As a great example, the fastest strlen on x86 (even better than the best asm I've seen) is an insanely unrolled loop that simply does:
if (!s[0]) return s-s0;
if (!s[1]) return s-s0+1;
if (!s[2]) return s-s0+2;
/* ... */
if (!s[31]) return s-s0+31;
However, this will tear through branch predictor slots, so for real-world purposes, some sort of vectorized approach is preferable.
I don't think it's common to fill an unrolled loop with conditional exits. That breaks most of the instruction scheduling which unrolling allows. What's more common is to check beforehand that the loop has at least n iterations remaining before entering into the unrolled section.
To acheive this the compiler may generate elaborate preamble and postamble to align the loop data for better vectorisation or better instruction scheduling, and to handle the remainder of the iterations which do not divide evenly into the unrolled section of the loop.
It can turn out (worst possible case) that the loop only runs zero or one time, or maybe twice in exceptional circumstances. Then only a small part of the loop would be executed, but many extra tests would be performed to get there. Worse; the alignment preamble might mean that different branch conditions occur in different calls, causing additional branch misprediction stalls.
These are all meant to cancel out over a large number of iterations, but for short loops this doesn't happen.
On top of this, you have the increased code size, where all of these unrolled loops together contribute to reducing icache efficiency.
And some architectures special-case very short loops to use their internal buffers without even referring to the cache.
And modern architectures have fairly extensive instruction reordering, even around memory accesses, which means that the compiler's reordering of the loop might offer no additional benefits even in the best case.
For example, unrolled function body larger than cache. Reading from memory is obviously slower.
Say you have a loop with 25 instructions and iterates 1000 times. The extra resources required to handle the 25,000 instructions could very well override the pain caused by branching.
It is also important to note that many kinds of looping branches are very painless, as the CPU has gotten quite good at branch predictions for simpler situations. For instance 8 iterations is probably more efficient unrolled, but even 50 is probably better left off to the CPU. Note that the compiler is probably better at guessing which is superior than you are.
Unrolling loops should always make the code faster. The trade-off is between faster code and larger code footprint. Tight loops (relatively small amounts of code executed in the body of the loop) which are executed a significant number of times can benefit from unrolling by removing all the loop overhead, and allowing the pipelining to do its thing. Loops which go through many iterations may unroll to a large amount of extra code - faster but maybe unacceptably larger footprint for the performance gain. Loops with a lot going on in the body may not benefit significantly from unrolling - the loop overhead becomes small compared to everything else.

For loop run time

I was having trouble understanding the following concepts of how processor speed affects how long a certain loop runs for.
For a computer with a 3GHz processor, and can do 64-bit arithmetic per cycle, for how long will the following loop run?
long long int x;
for(x = 0 x<=0; x--){}
The compiler may optimize this loop out entirely because it may detect that no result is ever used.
But if the loop is actually compiled, a guess at an upper bound on speed might be two cycles per iteration. Yes, the processor is probably superscaler, so it can sometimes execute more than one instruction in a cycle, but on the other hand one instruction is a branch, which tends to break the pipeline.
So, if we guess two cycles, then it will take about a century to run that loop.
irb> 2**63/(3*10**9)/60/60/24/7/52 # => 97 years
I'm tempted to say that the loop will never finish, as this is much longer than the MTBF for servers, UPS equipment, and power grids, but perhaps you could run it in a VM and checkpoint it periodically. :-)
Of course, there is that Greek fable on the folly of speculation when empirical evidence is available. Why not run the loop for a small amount of time and then calculate the actual result for 263 iterations? The speculation is difficult because few people other than the designers really understand today's complex microarchitectures. There are also many practical problems: does the compiler get to unroll the loop? Perhaps you should just write it in assembly so you can measure something specific?

how to count cycles?

I'm trying to find the find the relative merits of 2 small functions in C. One that adds by loop, one that adds by explicit variables. The functions are irrelevant themselves, but I'd like someone to teach me how to count cycles so as to compare the algorithms. So f1 will take 10 cycles, while f2 will take 8. That's the kind of reasoning I would like to do. No performance measurements (e.g. gprof experiments) at this point, just good old instruction counting.
Is there a good way to do this? Are there tools? Documentation? I'm writing C, compiling with gcc on an x86 architecture.
http://icl.cs.utk.edu/papi/
PAPI_get_real_cyc(3) - return the total number of cycles since some arbitrary starting point
Assembler instruction rdtsc (Read Time-Stamp Counter) retun in EDX:EAX registers the current CPU ticks count, started at CPU reset. If your CPU runing at 3GHz then one tick is 1/3GHz.
EDIT:
Under MS windows the API call QueryPerformanceFrequency return the number of ticks per second.
Unfortunately timing the code is as error prone as visually counting instructions and clock cycles. Be it a debugger or other tool or re-compiling the code with a re-run 10000000 times and time it kind of thing, you change where things land in the cache line, the frequency of the cache hits and misses, etc. You can mitigate some of this by adding or removing some code upstream from the module of code being tested, (to cause a few instructions added and removed changing the alignment of your program and sometimes of your data).
With experience you can develop an eye for performance by looking at the disassembly (as well as the high level code). There is no substitute for timing the code, problem is timing the code is error prone. The experience comes from many experiements and trying to fully understand why adding or removing one instruction made no or dramatic differences. Why code added or removed in a completely different unrelated area of the module under test made huge performance differences on the module under test.
As GJ has written in another answer I also recommend using the "rdtsc" instruction (rather than calling some operating system function which looks right).
I've written quite a few answers on this topic. Rdtsc allows you to calculate the elapsed clock cycles in the code's "natural" execution environment rather than having to resort to calling it ten million times which may not be feasible as not all functions are black boxes.
If you want to calculate elapsed time you might want to shut off energy-saving on the CPUs. If it's only a matter of clock cycles this is not necessary.
If you are trying to compare the performance, the easiest way is to put your algorithm in a loop and run it 1000 or 1000000 times.
Once you are running it enough times that the small differences can be seen, run time ./my_program which will give you the amount of processor time that it used.
Do this a few times to get a sampling and compare the results.
Trying to count instructions won't help you on x86 architecture. This is because different instructions can take significantly different amounts of time to execute.
I would recommend using simulators. Take a look at PTLsim it will give you the number of cycles, other than that maybe you would like to take a look at some tools to count the number of times each assembly line is executing.
Use gcc -S your_program.c. -S tells gcc to generate the assembly listing, that will be named your_program.s.
There are plenty of high performance clocks around. QueryPerformanceCounter is microsofts. The general trick is to run the function 10s of thousands of time and time how long it takes. Then divide the time taken by the number of loops. You'll find that each loop takes a slightly different length of time so this testing over multiple passes is the only way to truly find out how long it takes.
This is not really a trivial question. Let me try to explain:
There are several tools on different OS to do exactly what you want, but those tools are usually part of a bigger environment. Every instruction is translated into a certain number of cycles, depending on the CPU the compiler ran on, and the CPU the program was executed.
I can't give you a definitive answer, because I do not have enough data to pass my judgement on, but I work for IBM in the database area and we use tools to measure cycles and instructures for our code and those traces are only valid for the actual CPU the program was compiled and was running on.
Depending on the internal structure of your CPU's piplining and on the effeciency of your compiler, the resulting code will most likely still have cache misses and other areas you have to worry about. (In that case you may want to look into FDPR...)
If you want to know how many cycles your program needs to run on your CPU (which was compiled with your compiler), you have to understand how the CPU works and how the compiler generarated the code.
I'm sorry, if the answer was not sufficient enough to solve your problem at hand. You said you are using gcc on an x86 arch. I would work with getting the assembly code mapped to your CPU.
I'm sure you will find some areas, where gcc could have done a better job...

measure time to execute single instruction

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?

Resources