In which scenario would "unroll-loops" not making result code faster? - c

Taken from GCC manual:
-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
-funroll-loops implies -frerun-cse-after-loop. This option makes code larger, and may or may not make it
run faster.
According to my understanding, unroll loops will get rid of branching instructions in resulted code, I presume it is healthier for CPU pipelines.
But why would it "may not make it run faster"?

First of all, it may not make any difference; if your condition is "simple" and executed many times the branch predictor should quickly pick it up and always predict correctly the branch until the end of the loop, making the "rolled" code run almost as fast as the unrolled code.
Also, on non-pipelined CPUs the cost of a branch is quite small, so such optimization may not be relevant and code size considerations may be much more important (e.g. when compiling for a microcontroller - remember that gcc targets range from AVR micros to supercomputers).
Another case where unrolling can't speed up a loop is when the loop body is much slower than the looping itself - if e.g. you have a syscall in the body loop the loop overhead will be negligible compared to the system call.
As for when it may make your code run slower, making the code bigger can slow it down - if your code doesn't fit anymore in cache/memory page/... you'll have a cache/page/... fault and the processor will have to wait for the memory to fetch the code before executing it.

The answers so far are very good, but I'll add one thing that hasn't been touched on yet: eating up branch predictor slots. If your loop contains a branch, and it's not unrolled, it only consumes one branch predictor slot, so it won't evict other predictions the cpu has made in the outer loops, sister loops, or caller. However, if the loop body is duplicated many times via unrolling, each copy will contain a separate branch which consumes a predictor slot. This kind of performance hit is easily unnoticed, because, like cache eviction issues, it will not be visible in most isolated, artificial measurements of the loop performance. Instead, it will manifest as hurting the performance of other code.
As a great example, the fastest strlen on x86 (even better than the best asm I've seen) is an insanely unrolled loop that simply does:
if (!s[0]) return s-s0;
if (!s[1]) return s-s0+1;
if (!s[2]) return s-s0+2;
/* ... */
if (!s[31]) return s-s0+31;
However, this will tear through branch predictor slots, so for real-world purposes, some sort of vectorized approach is preferable.

I don't think it's common to fill an unrolled loop with conditional exits. That breaks most of the instruction scheduling which unrolling allows. What's more common is to check beforehand that the loop has at least n iterations remaining before entering into the unrolled section.
To acheive this the compiler may generate elaborate preamble and postamble to align the loop data for better vectorisation or better instruction scheduling, and to handle the remainder of the iterations which do not divide evenly into the unrolled section of the loop.
It can turn out (worst possible case) that the loop only runs zero or one time, or maybe twice in exceptional circumstances. Then only a small part of the loop would be executed, but many extra tests would be performed to get there. Worse; the alignment preamble might mean that different branch conditions occur in different calls, causing additional branch misprediction stalls.
These are all meant to cancel out over a large number of iterations, but for short loops this doesn't happen.
On top of this, you have the increased code size, where all of these unrolled loops together contribute to reducing icache efficiency.
And some architectures special-case very short loops to use their internal buffers without even referring to the cache.
And modern architectures have fairly extensive instruction reordering, even around memory accesses, which means that the compiler's reordering of the loop might offer no additional benefits even in the best case.

For example, unrolled function body larger than cache. Reading from memory is obviously slower.

Say you have a loop with 25 instructions and iterates 1000 times. The extra resources required to handle the 25,000 instructions could very well override the pain caused by branching.
It is also important to note that many kinds of looping branches are very painless, as the CPU has gotten quite good at branch predictions for simpler situations. For instance 8 iterations is probably more efficient unrolled, but even 50 is probably better left off to the CPU. Note that the compiler is probably better at guessing which is superior than you are.

Unrolling loops should always make the code faster. The trade-off is between faster code and larger code footprint. Tight loops (relatively small amounts of code executed in the body of the loop) which are executed a significant number of times can benefit from unrolling by removing all the loop overhead, and allowing the pipelining to do its thing. Loops which go through many iterations may unroll to a large amount of extra code - faster but maybe unacceptably larger footprint for the performance gain. Loops with a lot going on in the body may not benefit significantly from unrolling - the loop overhead becomes small compared to everything else.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

C loop unrolling limitations?

I am performing optimizations on C for-loops, and I just read up on unrolling and accumulators. If data is not dependent of eachother in the loop, the use of unrolling and accumulators really takes advantage of parallelism, and the code finishes faster.
So my naive thought was, why not add more accumulators and unroll more times?
I did this, and noticed that there was diminishing returns in the reduction of average cycles per element.
My question is why?
A: Is it because we are running out of registers to work with simultaneously, and information needs to be stored in memory?
B: Or is it because the 'cleanup loop' has to process more elements after the unrolled loop?
Is it a combination of A and B?
I'm not sure if I'm just stating the obvious here, but the main reason why you're seeing diminishing returns from unrolling is simply because you've largely eliminated the overhead from the loop, and the remaining time on the CPU is spent almost entirely in the "useful" work that you're doing.
The benefit of unrolling is that you're eliminating the overhead of the loop itself -- that is, index increment, comparisons, branching, &c. -- not that it makes the useful work of the loop any faster. When you've reached the point where the loop overhead is mostly eliminated, it should be obvious that you aren't going to see further improvements from more unrolling.
On the other hand, there are certainly some aspects of further unrolling that makes performance worse, such as registers spilling to memory, the I-cache working less efficiently, the loop being too large for the trace-cache (on processors that sport such), &c.
More likely, A. I've seen that not much time ago. I did myself the same question and the conclusion I reached was that I ran out of registers so no more fast accumulators. The clean-up code to process the rest of elements not unrolled run for much less time that the main unrolled loop.

For loop run time

I was having trouble understanding the following concepts of how processor speed affects how long a certain loop runs for.
For a computer with a 3GHz processor, and can do 64-bit arithmetic per cycle, for how long will the following loop run?
long long int x;
for(x = 0 x<=0; x--){}
The compiler may optimize this loop out entirely because it may detect that no result is ever used.
But if the loop is actually compiled, a guess at an upper bound on speed might be two cycles per iteration. Yes, the processor is probably superscaler, so it can sometimes execute more than one instruction in a cycle, but on the other hand one instruction is a branch, which tends to break the pipeline.
So, if we guess two cycles, then it will take about a century to run that loop.
irb> 2**63/(3*10**9)/60/60/24/7/52 # => 97 years
I'm tempted to say that the loop will never finish, as this is much longer than the MTBF for servers, UPS equipment, and power grids, but perhaps you could run it in a VM and checkpoint it periodically. :-)
Of course, there is that Greek fable on the folly of speculation when empirical evidence is available. Why not run the loop for a small amount of time and then calculate the actual result for 263 iterations? The speculation is difficult because few people other than the designers really understand today's complex microarchitectures. There are also many practical problems: does the compiler get to unroll the loop? Perhaps you should just write it in assembly so you can measure something specific?

Is a pointer indirection more costly than a conditional?

Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.
This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.
It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.
It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.
Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

Programming without jumps

I try to find articles, books or anything about programming without jumps (x86 arch). I know that generally it is impossible but I try to avoid jumps but gcc even with inline func uses jumps many times. Coding only in Assembly is some sort of solution, but writing equivalent of 1000 lines in C is like hell party to my eyes..
Unless your jumps are really random, branch prediction should eliminate most of overhead involved.
I would dedicate more effort to optimizing memory access patterns in order to improve locality and reduce cache misses. These days, memory latency is the major bottleneck to performance.
Another good direction is improving parallelism (using both vectorized SIMD instructions and, if possible, more than one core).
Optimize only performance critical code, and only once you really know it is performance critical. Do not try to optimize jumps only because you read they case a performance hit. Everything causes a performance hit, and the fastest possible code is the code which does nothing. There are other things much worse than jumps.
If you will show a particular example of a jump in the generated code, chance is there will be some way to avoid it, but it is more likely the code you will show will still contain more serious issues.
One particular way how to avoid branches is to use "conditional move" instructions. They can be used e.g. to compute max or min. If you allow the compiler to use SSE architecture, it assumes the CPU also supports CMOV/FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions and will use them (beware: sometimes it may be tricky to make the compiler to do what you want, see e.g. this gamedev.net discussion).
I think you may mean branching. In C there are bit twiddling tricks to use to speed up certain operations
See bit hacks:
http://www-graphics.stanford.edu/~seander/bithacks.html
It is not impossible to code without jumps but it seems pointless to try.
In the end if you need to do something more than once then your choices are:
Loop unrolling (i.e. repeating the code instead of looping).
Somehow get the instruction pointer to visit the same code more than once.
The first approach requiers knowing the number of iterations in advance and doesn't scale and the second involves some sort of jump.
Not knowing what your code looks like, it's hard to give any advice. But I will give it a try.
Before you start optimizing, run a profiling tool to locate the problem areas. After optimizing, run the profiling tool again to see if you actually made it faster.
It's hard to actually remove branches, but you can minimize them by doing loop unrolling.
Someone mentioned conditional move instructions, there's plenty of conditional instructions on the ARM architecture, but if they're not executed they will translate to a NOP and take one cycle each. Not sure how they work on x86. It might actually get slower then using a simple branch depending on how long the pipeline is.
There's a lot of other optimizing tricks you could try before removing branches.

Resources