C loop unrolling limitations?

C loop unrolling limitations? - c

I am performing optimizations on C for-loops, and I just read up on unrolling and accumulators. If data is not dependent of eachother in the loop, the use of unrolling and accumulators really takes advantage of parallelism, and the code finishes faster.
So my naive thought was, why not add more accumulators and unroll more times?
I did this, and noticed that there was diminishing returns in the reduction of average cycles per element.
My question is why?
A: Is it because we are running out of registers to work with simultaneously, and information needs to be stored in memory?
B: Or is it because the 'cleanup loop' has to process more elements after the unrolled loop?
Is it a combination of A and B?

I'm not sure if I'm just stating the obvious here, but the main reason why you're seeing diminishing returns from unrolling is simply because you've largely eliminated the overhead from the loop, and the remaining time on the CPU is spent almost entirely in the "useful" work that you're doing.
The benefit of unrolling is that you're eliminating the overhead of the loop itself -- that is, index increment, comparisons, branching, &c. -- not that it makes the useful work of the loop any faster. When you've reached the point where the loop overhead is mostly eliminated, it should be obvious that you aren't going to see further improvements from more unrolling.
On the other hand, there are certainly some aspects of further unrolling that makes performance worse, such as registers spilling to memory, the I-cache working less efficiently, the loop being too large for the trace-cache (on processors that sport such), &c.

More likely, A. I've seen that not much time ago. I did myself the same question and the conclusion I reached was that I ran out of registers so no more fast accumulators. The clean-up code to process the rest of elements not unrolled run for much less time that the main unrolled loop.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Is there a loop construct that repeats n times without calculating some conditional?

This question arose out of the context of optimizing code to remove potential branch prediction failures.. in fact, removing branches all together.
For my example, a typical for-loop uses the following syntax:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main()
{
bool *S = calloc(N + 1, sizeof(bool));
int p = 2;
for (int i = p * p; i <= N; i += p)
S[i] = 1;
return 0;
}
As I understand it, the assembly code produced would contain some sort of JMP instruction to check whether i <= N.
The only way I can conceive doing this in assembly would be to just repeat the same assembly instructions n times with i being incremented in each repetition. However, for large n, the binary produced would be huge.
So, I'm wondering, is there some loop construct that repeats n times without calculating some conditional?

Fully unrolling is the only practical way, and you're right that it's not appropriate for huge iteration counts (and usually not for loops where the iteration count isn't a compile-time constant). It's great for small loops with a known iteration count.
For your specific case, of setting memory to 1 over a given distance, x86 has rep stosb, which implements memset in microcode. It doesn't suffer from branch misprediction on current Intel CPUs, because microcode branches can't be predicted / speculatively executed and stall instead (or something), which means it has about 15 cycles of startup overhead before doing any stores. See What setup does REP do? For large
aligned buffers, rep stosb/d/q is pretty close to an optimized AVX loop.
So you don't get a mispredict, but you do get a fixed startup overhead. (I think this stalls issue / execute of instructions after the rep stos, because the microcode sequencer takes over the front end until it's done issuing all the uops. And it can't know when it's done until it's executed some that look at rcx. Issue happens in-order, so later independent instructions can't even get into the out-of-order part of the core until after rep stos figures out which uops to issue. The stores don't have to execute, but the microcode branches do.) Icelake is supposed to have "fast short REP MOV", which may finally solve the startup-overhead issue. Possibly by adding a dedicated hardware state-machine for rep movs and rep stos, like #KrazyGlew wishes he had when designing fast-strings in Intel's P6 uarch (ppro/PII/PIII), the ancestor of current Intel CPUs which still use very similar microcoded implementations. AFAIK, AMD is similar, but I haven't seen numbers or details for their rep stos startup overhead.
Most architectures don't have a single-instruction memset like that, so x86 is definitely a special case in computer architecture. But some old computers (like Atari ST) could have a "blitter" chip to offload copying memory, especially for graphics purposes. They'd use DMA to do the copying separately from the CPU altogether. This would be very similar to building a hardware state machine on-chip to run rep movs.
Misprediction of normal loop branches
Consider a normal asm loop, like
.looptop: do {
# loop body includes a pointer increment
# may be unrolled some to do multiple stores per conditional branch
cmp rdi, rdx
jb .looptop } while(dst < end_dst);
The branch ends up strongly predicted-taken after the loop has run a few time.
For large iteration counts, one branch mispredict is amortized over all the loop iterations and is typically negligible. (The conditional branch at the bottom of the loop is predicted taken, so the loop branch mispredicts the one time it's not taken.)
Some branch predictors have special support for loop branches, with a pattern predictor that can count patterns like taken 30 times / not taken. With some large limit for the max iteration count they can correctly predict.
Or a modern TAGE predictor (like in Intel Sandybridge-family) uses branch history to index the entry, so it "naturally" gives you some pattern prediction. For example, Skylake correctly predicts the loop-exit branch for inner loops up to about 22 iterations in my testing. (When there's a simple outer loop that re-runs the inner loop with the same iteration count repeatedly.) I tested in asm, not C, so I had control of how much unrolling was done.
A medium-length loop that's too long for the exit to be predicted correctly is the worst case for this. It's short enough that a mispredict on loop exit happens frequently, if it's an inner loop that repeatedly runs ~30 iterations on a CPU that can't predict that, for example.
The cost of one mispredict on the last iteration can be quite low. If out-of-order execution can "see" a few iterations ahead for a simple loop counter, it can have the branch itself executed before spending much time on real work beyond that mispredicted branch. With fast recovery for branch misses (not flushing the whole out-of-order pipeline by using something like a Branch Order Buffer), you still lose the chance to get started on independent work after the loop for a few cycles. This is most likely if the loop bottlenecks on latency of a dependency chain, but the counter itself is a separate chain. This paper about the real cost of branch misses is also interesting, and mentions this point.
(Note that I'm assuming that branch prediction is already "primed", so the first iteration correctly predicts the loop branch as taken.)
Impractical ways:
#Hadi linked an amusing idea: instead of running code normally, compile in a weird way where control and instructions are all data, like for example x86 using only mov instructions with x86 addressing modes + registers. (and an unconditional branch at the bottom of a big block). Is it possible to make decisions in assembly without using `jump` and `goto` at all? This is hilariously inefficient, but doesn't actually have any conditional branches: everything is a data dependency.
It uses different forms of MOV (between registers, and immediate to register, as well as load/store), so it's not a one-instruction-set computer.
A less-insane version of this is an interpreter: Different instructions / operations in the code being interpreted turn into control dependencies in the interpreter (creating a hard-to-solve efficiency problem for the interpret, see Darek Mihocka's "The Common CPU Interpreter Loop Revisited" article, but data and control flow in the guest code are both data in the interpreter. The guest program counter is just another piece of data in the interpreter.

In which scenario would "unroll-loops" not making result code faster?

Taken from GCC manual:
-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
-funroll-loops implies -frerun-cse-after-loop. This option makes code larger, and may or may not make it
run faster.
According to my understanding, unroll loops will get rid of branching instructions in resulted code, I presume it is healthier for CPU pipelines.
But why would it "may not make it run faster"?

First of all, it may not make any difference; if your condition is "simple" and executed many times the branch predictor should quickly pick it up and always predict correctly the branch until the end of the loop, making the "rolled" code run almost as fast as the unrolled code.
Also, on non-pipelined CPUs the cost of a branch is quite small, so such optimization may not be relevant and code size considerations may be much more important (e.g. when compiling for a microcontroller - remember that gcc targets range from AVR micros to supercomputers).
Another case where unrolling can't speed up a loop is when the loop body is much slower than the looping itself - if e.g. you have a syscall in the body loop the loop overhead will be negligible compared to the system call.
As for when it may make your code run slower, making the code bigger can slow it down - if your code doesn't fit anymore in cache/memory page/... you'll have a cache/page/... fault and the processor will have to wait for the memory to fetch the code before executing it.

The answers so far are very good, but I'll add one thing that hasn't been touched on yet: eating up branch predictor slots. If your loop contains a branch, and it's not unrolled, it only consumes one branch predictor slot, so it won't evict other predictions the cpu has made in the outer loops, sister loops, or caller. However, if the loop body is duplicated many times via unrolling, each copy will contain a separate branch which consumes a predictor slot. This kind of performance hit is easily unnoticed, because, like cache eviction issues, it will not be visible in most isolated, artificial measurements of the loop performance. Instead, it will manifest as hurting the performance of other code.
As a great example, the fastest strlen on x86 (even better than the best asm I've seen) is an insanely unrolled loop that simply does:
if (!s[0]) return s-s0;
if (!s[1]) return s-s0+1;
if (!s[2]) return s-s0+2;
/* ... */
if (!s[31]) return s-s0+31;
However, this will tear through branch predictor slots, so for real-world purposes, some sort of vectorized approach is preferable.

I don't think it's common to fill an unrolled loop with conditional exits. That breaks most of the instruction scheduling which unrolling allows. What's more common is to check beforehand that the loop has at least n iterations remaining before entering into the unrolled section.
To acheive this the compiler may generate elaborate preamble and postamble to align the loop data for better vectorisation or better instruction scheduling, and to handle the remainder of the iterations which do not divide evenly into the unrolled section of the loop.
It can turn out (worst possible case) that the loop only runs zero or one time, or maybe twice in exceptional circumstances. Then only a small part of the loop would be executed, but many extra tests would be performed to get there. Worse; the alignment preamble might mean that different branch conditions occur in different calls, causing additional branch misprediction stalls.
These are all meant to cancel out over a large number of iterations, but for short loops this doesn't happen.
On top of this, you have the increased code size, where all of these unrolled loops together contribute to reducing icache efficiency.
And some architectures special-case very short loops to use their internal buffers without even referring to the cache.
And modern architectures have fairly extensive instruction reordering, even around memory accesses, which means that the compiler's reordering of the loop might offer no additional benefits even in the best case.

For example, unrolled function body larger than cache. Reading from memory is obviously slower.

Say you have a loop with 25 instructions and iterates 1000 times. The extra resources required to handle the 25,000 instructions could very well override the pain caused by branching.
It is also important to note that many kinds of looping branches are very painless, as the CPU has gotten quite good at branch predictions for simpler situations. For instance 8 iterations is probably more efficient unrolled, but even 50 is probably better left off to the CPU. Note that the compiler is probably better at guessing which is superior than you are.

Unrolling loops should always make the code faster. The trade-off is between faster code and larger code footprint. Tight loops (relatively small amounts of code executed in the body of the loop) which are executed a significant number of times can benefit from unrolling by removing all the loop overhead, and allowing the pipelining to do its thing. Loops which go through many iterations may unroll to a large amount of extra code - faster but maybe unacceptably larger footprint for the performance gain. Loops with a lot going on in the body may not benefit significantly from unrolling - the loop overhead becomes small compared to everything else.

For loop run time

I was having trouble understanding the following concepts of how processor speed affects how long a certain loop runs for.
For a computer with a 3GHz processor, and can do 64-bit arithmetic per cycle, for how long will the following loop run?
long long int x;
for(x = 0 x<=0; x--){}

The compiler may optimize this loop out entirely because it may detect that no result is ever used.
But if the loop is actually compiled, a guess at an upper bound on speed might be two cycles per iteration. Yes, the processor is probably superscaler, so it can sometimes execute more than one instruction in a cycle, but on the other hand one instruction is a branch, which tends to break the pipeline.
So, if we guess two cycles, then it will take about a century to run that loop.
irb> 2**63/(3*10**9)/60/60/24/7/52 # => 97 years
I'm tempted to say that the loop will never finish, as this is much longer than the MTBF for servers, UPS equipment, and power grids, but perhaps you could run it in a VM and checkpoint it periodically. :-)
Of course, there is that Greek fable on the folly of speculation when empirical evidence is available. Why not run the loop for a small amount of time and then calculate the actual result for 263 iterations? The speculation is difficult because few people other than the designers really understand today's complex microarchitectures. There are also many practical problems: does the compiler get to unroll the loop? Perhaps you should just write it in assembly so you can measure something specific?

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight