Loop unrolling & optimization

Loop unrolling & optimization - c

Given the code :
for (int i = 0; i < n; ++i)
{
A(i) ;
B(i) ;
C(i) ;
}
And the optimization version :
for (int i = 0; i < (n - 2); i+=3)
{
A(i)
A(i+1)
A(i+2)
B(i)
B(i+1)
B(i+2)
C(i)
C(i+1)
C(i+2)
}
Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?
All I see is that each instruction is depending on the previous instruction , meaning that
I need to wait that the previous instruction would finish in order to start the one after ...
Thanks

In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.
In the first case, it's something like:
LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
In the second it's something like:
LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.
Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.

Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.

Well, whether this code is "better" or "worse" totally depends on implementations of A, B and C, which values of n you expect, which compiler you are using and which hardware you are running on.
Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i and comparing it with n) is reduced. In this case, could be reduced by a factor of 3.

As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.
In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.

Generally its not a good idea to try to "invent" optimizations, unless you have hard evidence that you will gain an increase, because many times you may end up introducing a degradation. Typically the best way to obtain such evidence is with a good profiler. I would test both versions of this code with a profiler to see the difference.
Also, many times loop unrolling isnt very protable, as mentioned previously, it depends greatly on the platform, compiler, etc.
You can additionally play with the compiler options. An interesting gcc option is "-floop-optimize", that you get automatically with "-O, -O2, -O3, and -Os"
EDIT Additionally, look at the "-funroll-loops" compiler option.

Related

Effect of non-divisible loop sizes at runtime on openMP SIMD

After reading several different articles and not finding an answer I am going to introduce the problem and then ask the question.
I have a section of code that can be reduced down to a series of loops like look like the following.
#pragma omp parallel for simd
for(int i = 0; i < a*b*c; i++)
{
array1[i] += array2[i] * array3[i];
}
Now most examples of SIMD use that I have encountered have a, b and c fixed at compile time, allowing for the optimisation to take place. However, my code requires that the values of a b and c are determined at run time.
Lets say that for the case of the computer I am using the register can fit 4 values, and that the value of abc is 127. My understanding of compilation time for this is that the compiler will vectorise everything that is wholly divisible by 4, then serialise the rest (please correct this if I am wrong). However this is when the compiler has full knowledge of the problem. If I were to now allow a run time choice of a, b and c and came to the value of 127, how would vectorisation proceed? Naively I would assume that the code behind the scenes is intelligent enough to understand this might happen have have both a serial and vector code and calls the most suitable. However, as this is an assumption, I would appreciate someone more knowledgeable on the subject to enlighten me further, as I don't want accidental overflows, or non-processing of data, due to a misunderstanding.
On the off chance this matters, I am using OpenMP 4.0 with a C gcc compiler, although I am hoping this will not change your answer as I will always attempt to use the latest OpenMP version and unfortunately may need to routinely change compiler.

Typically, a compiler will unroll beyond the simd length. For optimum results, particularly with gcc, you would specify this unroll factor, e.g. --param max-unroll-times=2 (if you don't expect much longer loops). with a simd length of 4, the loop would consume 8 iterations at a time, leaving a remainder. gcc would build a remainder loop, somewhat like Duff's device, which might have 15 iterations, and would calculate where to jump in at run time. Intel compiler handles a vectorized remainder loop in a different way. Supposing you have 2 simd widths available, the remainder loop would use the shorter width without unrolling, so that the serial part is as short as possible. When compiling for the general case of unaligned data, there is a remainder loop at both ends, with the one at the beginning limited to the length required for alignment of the stored values. With the combination omp parallel simd, the situation gets more complicated; normally, the loop chunks must vary in size, and one might argue that the interior chunks might be set up for alignment, with the end chunks smaller (not normally done).

Is using multiple loops same performance as using one that does multiple things?

Is this:
int x=0;
for (int i=0;i<100;i++)
x++;
for (int i=0;i<100;i++)
x--;
for (int i=0;i<100;i++)
x++;
return x;
Same as this:
int x=0;
for (int i=0;i<100;i++){
x++;
x--;
x++;
}
return x;
Note: This is just an example, the real loop would be much more complex.
So are these two loops the same or is the second one faster?
EDIT: Java or C++. I was wondering about the both.
I didn't know that compiler would actually optimize the code.

Unoptimized: three loops take longer, since there are three sets of loop opcodes.
Optimized, it depends on the optimizer. A good optimizer might be smart enough to realize that the x++;x--; statements in the single-loop version cancel each other out, and eliminate them. A really smart optimizer might be able to do the same thing with the separate loops. A ridiculously smart optimizer might figure out what the code is doing, and just replace the whole block with return 100; (see added note below)
But the real-world answer for optimization is usually: fuhgeddaboutit. If your code gets its job done correctly, and fast enough to be useful, leave it alone. Only if actual tests show it's too slow should you profile to identify the bottlenecks and replace them with more efficient code. (Or a better algorithm entirely.)
Programmers are expensive, CPU cycles are cheap, and there are plenty of other tasks with bigger payoffs. And more fun to write, too.
about the "ridiculously smart optimizer" bit: the D language offers Compile-Time Function Evaluation. CTFE allows you to use virtually the full capability of the language to compute something at build time, then insert only the computed answer into the runtime code. In other words, you can explicitly turn the entire compiler into your optimizer for selected chunks of code.

If you count each increment, decrement, assignment and comparison as one operation, your first example has some 900 operations, while your second example has ~500. That is, if the code is executed as is and not optimized. It should be obvious which is more performant.
In reality the code may or may not be optimized by a compiler, and different compilers for different languages will do quite a different job at optimization.

Array access/write performance differences?

This is probably going to language dependent, but in general, what is the performance difference between accessing and writing to an array?
For example, if I am trying to write a prime sieve and am representing the primes as a boolean array.
Upon finding a prime, I can say
for(int i = 2; n * i < end; i++)
{
prime[n * i] = false;
}
or
for(int i = 2; n * i < end; i++)
{
if(prime[n * i])
{
prime[n * i] = false;
}
}
The intent in the latter case is to check the value before writing it to avoid having to rewrite many values that have already been checked. Is there any realistic gain in performance here, or are access and write mostly equivalent in speed?

Impossible to answer such a generic question without the specifics of the machine/OS this is running on, but in general the latter is going to be slower because:
The second example you have to get the value from RAM to L2/L1 cache and read it to a register, make a chance on the value and write it back. In the first case you might very well get away with simply writing a value to the L1/L2 caches. It can written to RAM from the caches later while your program is doing something else.
The second form has much more code to execute per iteration. For large enough number of iterations, the difference gets big real fast.

In general this depends much more on the machine than the programing language. The writes often will take a few more clock cycles because, depending on the machine, more cache values need to be updated in memory.
However, your second segment of code will be WAY slower, and it's not just because there's "more code". The big reason is that anytime you use an if-statement on most machines the CPU uses a branch predictor. The CPU literally predicts which way the if-statement will run ahead of time, and if it's wrong it has to backtrack. See http://en.wikipedia.org/wiki/Pipeline_%28computing%29 and http://en.wikipedia.org/wiki/Branch_predictor to understand why.
If you want to do some optimization, I would recommend the following:
Profile! See what's really taking up time.
Multiplication is much harder than addition. Try rewriting the loop so that i += n, and use this for your array index.
The loop condition "should" be totally reevaluated at every iteration unless the compiler optimizes it away. So try avoiding multiplication in there.
Use -O2 or -O3 as a compiler option
You might find that some values of n are faster than others because of cache locality. You might think of some clever ways to rewrite your code to take advantage of this.
Disassemble the code and look at what it's actually doing on your processor

It's a hard question and it heavily depends on your hardware, OS and complier. But for sake of theory, you should consider two things: branching and memory access. As branching is generally evil, you want to avoid it. I wouldn't even surprise if some compiler optimization took place and your second snippet would be reduced to the first one (compilers love avoiding branches, they probably consider it as a hobby, but they have a reason). So in these terms the first example is much cleaner and easier to deal with.
There're also CPU caches and other memory related issues. I believe that in both examples you have to actually load the memory into the CPU cache, so you can either read it or update. While reading is not a problem, writing have to propagate the changes up. I wouldn't be worried if you use the function in a single thread (as #gby pointed out, OS can push the changes a little bit later).
There is only one scenario I can come up with, that would make me consider solution from your second example. If I shared the table between threads to work on it in parallel (without locking) and had separate caches for different CPUs. Then, every time you amend the cache line from one thread, the other thread have to update it's copy before reading or writing to the same memory block. It's known as a cache coherence and it actually may hurt your performance badly; in such a case I could consider conditional writes. But wait, it's probably far away from your question...

Use two loop bodies or one (result identical)?

I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}

To measure is to know.

I can see three variables (even in a seemingly simple chunk of code):
What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.
A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.

Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.
On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.
As you say: it depends.

Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.
If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.
Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.
Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.
Assuming I understood your question correctly I think the single loop is probably faster in general.

Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...
From your example:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
I would either fuse the two loops into one loop like this:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
k += g(i);
}
Of if this is not possible do the optimization called Loop-Tiling:
#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps */
/* the working-set below your first level cache size */
int i=0;
int elements = 100000;
do {
int n = i+TILE_SIZE;
if (n > elements) n = elements;
// perform loop A
for (int a=i; a<n; a++)
{
j += f(i);
}
// perform loop B
for (int a=i; a<n; a++)
{
k += g(i);
}
i += n
} while (i != elements)
The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.
Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..

If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.
But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.

This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).

What techniques to avoid conditional branching do you know?

Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I've seen a few techniques on very isolated threads but never a list. The ones I know already fix situations where the condition can be turned to a bool and that 0/1 is used in some way to change. Are there other conditional branches that can be avoided?
e.g. (pseudocode)
loop () {
if (in[i] < C )
out[o++] = in[i++]
...
}
Can be rewritten, arguably losing some readability, with something like this:
loop() {
out[o] = in[i] // copy anyway, just don't increment
inc = in[i] < C // increment counters? (0 or 1)
o += inc
i += inc
}
Also I've seen techniques in the wild changing && to & in the conditional in certain contexts escaping my mind right now. I'm a rookie at this level of optimization but it sure feels like there's got to be more.

Using Matt Joiner's example:
if (b > a) b = a;
You could also do the following, without having to dig into assembly code:
bool if_else = b > a;
b = a * if_else + b * !if_else;

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.
As someone else has mentioned, if you want to do more than unrolling loops, and providing branch hints, you're going to want to drop into assembly. Of course this should be done with utmost caution: your typical compiler can write better assembly in most cases than a human. Your best hope is to shave off rough edges, and make assumptions that the compiler cannot deduce.
Here's an example of the following C code:
if (b > a) b = a;
In assembly without any jumps, by using bit-manipulation (and extreme commenting):
sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0
Note that while conditional moves are immediately jumped on by assembly enthusiasts, that's only because they're easily understood and provide a higher level language concept in a convenient single instruction. They are not necessarily faster, not available on older processors, and by mapping your C code into corresponding conditional move instructions you're just doing the work of the compiler.

The generalization of the example you give is "replace conditional evaluation with math"; conditional-branch avoidance largely boils down to that.
What's going on with replacing && with & is that, since && is short-circuit, it constitutes conditional evaluation in and of itself. & gets you the same logical results if both sides are either 0 or 1, and isn't short-circuit. Same applies to || and | except you don't need to make sure the sides are constrained to 0 or 1 (again, for logic purposes only, i.e. you're using the result only Booleanly).

At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.
Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.

An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.
if (a) {
if (b) {
result = q;
} else {
result = r;
}
} else {
if (b) {
result = s;
} else {
result = t;
}
}
If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:
// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];
You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.

GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.
Additionaly to compute minimum you can use (see these magic tricks):
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
However, pay attention to things like:
c[i][j] = min(c[i][j], c[i][k] + c[j][k]); // from Floyd-Warshal algorithm
even no jumps are implied is much slower than
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
c[i][j] = tmp;
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.

In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.
Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?

Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.
You're better off worrying about cache hits and misses.

This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight