What techniques to avoid conditional branching do you know? - c

Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I've seen a few techniques on very isolated threads but never a list. The ones I know already fix situations where the condition can be turned to a bool and that 0/1 is used in some way to change. Are there other conditional branches that can be avoided?
e.g. (pseudocode)
loop () {
if (in[i] < C )
out[o++] = in[i++]
...
}
Can be rewritten, arguably losing some readability, with something like this:
loop() {
out[o] = in[i] // copy anyway, just don't increment
inc = in[i] < C // increment counters? (0 or 1)
o += inc
i += inc
}
Also I've seen techniques in the wild changing && to & in the conditional in certain contexts escaping my mind right now. I'm a rookie at this level of optimization but it sure feels like there's got to be more.

Using Matt Joiner's example:
if (b > a) b = a;
You could also do the following, without having to dig into assembly code:
bool if_else = b > a;
b = a * if_else + b * !if_else;

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.
As someone else has mentioned, if you want to do more than unrolling loops, and providing branch hints, you're going to want to drop into assembly. Of course this should be done with utmost caution: your typical compiler can write better assembly in most cases than a human. Your best hope is to shave off rough edges, and make assumptions that the compiler cannot deduce.
Here's an example of the following C code:
if (b > a) b = a;
In assembly without any jumps, by using bit-manipulation (and extreme commenting):
sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0
Note that while conditional moves are immediately jumped on by assembly enthusiasts, that's only because they're easily understood and provide a higher level language concept in a convenient single instruction. They are not necessarily faster, not available on older processors, and by mapping your C code into corresponding conditional move instructions you're just doing the work of the compiler.

The generalization of the example you give is "replace conditional evaluation with math"; conditional-branch avoidance largely boils down to that.
What's going on with replacing && with & is that, since && is short-circuit, it constitutes conditional evaluation in and of itself. & gets you the same logical results if both sides are either 0 or 1, and isn't short-circuit. Same applies to || and | except you don't need to make sure the sides are constrained to 0 or 1 (again, for logic purposes only, i.e. you're using the result only Booleanly).

At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.
Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.

An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.
if (a) {
if (b) {
result = q;
} else {
result = r;
}
} else {
if (b) {
result = s;
} else {
result = t;
}
}
If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:
// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];
You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.

GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.
Additionaly to compute minimum you can use (see these magic tricks):
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
However, pay attention to things like:
c[i][j] = min(c[i][j], c[i][k] + c[j][k]); // from Floyd-Warshal algorithm
even no jumps are implied is much slower than
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
c[i][j] = tmp;
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.

In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.
Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?

Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.
You're better off worrying about cache hits and misses.

This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.

Related

Influencing branchiness when branch behaviour is known

Before I begin, yes, I'm aware of the compiler built-ins __builtin_expect and __builtin_unpredictable (Clang). They do solve the issue to some extent, but my question is about something neither completely solves.
As a very simple example, suppose we have the following code.
void highly_contrived_example(unsigned int * numbers, unsigned int count) {
unsigned int * const end = numbers + count;
for (unsigned int * iterator = numbers; iterator != end; ++ iterator)
foo(* iterator % 2 == 0 ? 420 : 69);
}
Nothing complicated at all. Just calls foo() with 420 whenever the current number is even, and with 69 when it isn't.
Suppose, however, that it is known ahead of time that the data is guaranteed to look a certain way. For example, if it were always random, then a conditional select (csel (ARM), cmov (x86), etc) possibly would be better than a branch.⁰ If it were always in highly predictable patterns (e.g. always a lengthy stream of evens/odds before a lengthy stream of the other, and so on), then a branch would be better.⁰ __builtin_expect would not really solve the issue if the number of evens/odds were about equal, and I'm not sure whether the absence of __builtin_unpredictable would influence branchiness (plus, it's Clang-only).
My current "solution" is to lie to the compiler and use __builtin_expect with a high probability of whichever side, to influence the compiler to generate a branch in the predictable case (for simple cases like this, all it seems to do is change the ordering of the comparison to suit the expected probability), and __builtin_unpredictable to influence it to not generate a branch, if possible, in the unpredictable case.¹ Either that or inline assembly. That's always fun to use.
⁰ Although I have not actually done any benchmarks, I'm aware that even using a branch may not necessarily be faster than a conditional select for the given example. The example is only for illustrative purposes, and may not actually exhibit the problem described.
¹ Modern compilers are smart. More often than not, they can determine reasonably well which approach to actually use. My question is for the niche cases in which they cannot reasonably figure that out, and in which the performance difference actually matters.

elegant (and fast!) way to rearrange columns and rows in an ADC buffer

Abstract:
I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
Introduction:
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
Issue:
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
bP++;
}
}
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
Question(s):
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!
ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
{
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
}
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
{
*itSums += *itRaw;
itRaw++;
itSums++;
if (itSums == itSumsEnd) itSums = sums;
}
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.

Faster way to copy C array with calculation between

I want to copy an C array data to another, but with a calculation between (i.e. not just copying the same content from one to another, but having a modification in the data):
int aaa;
int src[ARRAY_SIZE];
int dest[ARRAY_SIZE];
//fill src with data
for (aaa = 0; aaa < ARRAY_SIZE; aaa++)
{
dest[aaa] = src[aaa] * 30;
}
This is done in buffers of size 520 or higher, so the for loop is considerable.
Is there any way to improve performance here in what comes to coding?
I did some research on the topic, but I couldn't find anything specific about this case, only about simple copy buffer to buffer (examples: here, here and here).
Environment: GCC for ARM using Embedded Linux. The specific code above, though, is used inside a C project running inside a dedicated processor for DSP calculations. The general processor is an OMAP L138 (the DSP processor is included in the L138).
You could try techniques such as loop-unrolling or duff's device, but if you switch on compiler optimisation it will probably do that for you in any case if it is advantageous without making your code unreadable.
The advantage of relying on compiler optimisation is that it is architecture specific; a source-level technique that works on one target may not work so well on another, but compiler generated optimisations will be specific to the target. For example there is no way to code specifically for SIMD instructions in C, but the compiler may generate code to take advantage of them, and to do that, it is best to keep the code simple and direct so that the compiler can spot the idiom. Writing weird code to "hand optimise" can defeat the optimizer and stop it doing its job.
Another possibility that may be advantageous on some targets (if you are only ever coding for desktop x86 targets, this may be irrelevant), is to avoid the multiply instruction by using shifts:
Given that x * 30 is equivalent to x * 32 - x * 2, the expression in the loop can be replaced with:
input[aaa] = (output[aaa] << 5) - (output[aaa] << 1) ;
But again the optimizer may well do that for you; it will also avoid the repeated evaluation of output[aaa], but if that were not the case, the following may be beneficial:
int i = output[aaa] ;
input[aaa] = (i << 5) - (i << 1) ;
The shifting technique is likely to be more advantageous for division operations which are far more expensive on most targets, and it is only applicable to constants.
These techniques are likely to improve the performance of unoptimized code, but compiler optimisations will likely do far better, and the original code may optimise better than "hand optimised" code.
In the end if it is important, you have to experiment and perform timing tests or profiling.

what is the fastest algorithm for permutations of three condition?

Can anybody help me regarding quickest method for evaluating three conditions in minimum steps?
I have three conditions and if any of the two comes out to be true,then whole expression becomes true else false.
I have tried two methods:
if ((condition1 && condition2) ||
(condition1 && condition3) ||
(condition2 && condition3))
Another way is to by introducing variable i and
i = 0;
if (condition1) i++;
if (condition2) i++;
if (condition3) i++;
if (i >= 2)
//do something
I want any other effective method better than the above two.
I am working in a memory constrained environment (Atmeta8 with 8 KB of flash memory) and need a solution that works in C.
This can be reduced to:
if((condition1 && (condition2 || condition3)) || (condition2 && condition3))
//do something
Depending on the likelihood of each condition, you may be able to optimize the ordering to get faster short-circuits (although this would probably be premature optimization...)
It is always hard to give a just "better" solution (better in what regard -- lines of code, readability, execution speed, number of bytes of machine code instructions, ...?) but since you are asking about execution speed in this case, we can focus on that.
You can introduce that variable you suggest, and use it to reduce the conditions to a simple less-than condition once the answer is known. Less-than conditions trivially translate to two machine code instructions on most architectures (for example, CMP (compare) followed by JL (jump if less than) or JNL (jump if not less than) on Intel IA-32). With a little luck, the compiler will notice (or you can do it yourself, but I prefer the clarity that comes with having the same pattern everywhere) that trues < 2 will always be true in the first two if() statements, and optimize it out.
int trues = 0;
if (trues < 2 && condition1) trues++;
if (trues < 2 && condition2) trues++;
if (trues < 2 && condition3) trues++;
// ...
if (trues >= 2)
{
// do something
}
This, once an answer is known, reduces the possibly complex evaluation of conditionN to a simple less-than comparison, because of the boolean short-circuiting behavior of most languages.
Another possible variant, if your language allows you to cast a boolean condition to an integer, is to take advantage of that to reduce the number of source code lines. You will still be evaluating each condition, however.
if( (int)(condition1)
+ (int)(condition2)
+ (int)(condition3)
>= 2)
{
// do something
}
This works based on the assumption that casting a boolean FALSE value to an integer results in 0, and casting TRUE results in 1. You can also use the conditional operator for the same effect, although be aware that it may introduce additional branching.
if( ((condition1) ? 1 : 0)
+ ((condition2) ? 1 : 0)
+ ((condition3) ? 1 : 0)
>= 2)
{
// do something
}
Depending on how smart the compiler's optimzer is, it may be able to determine that once any two conditions have evaluated to true the entire condition will always evaluate to true, and optimize based on that.
Note that unless you have actually profiled your code and determined this to be the culprit, this is likely a case of premature optimization. Always strive for code to be readable by human programmers first, and fast to execute by the computer second, unless you can show definitive proof that the particular piece of code you are looking at is an actual performance bottleneck. Learn how that profiler works and put it to good use. Keep in mind that in most cases, programmer time is an awful lot more expensive than CPU time, and clever techniques take longer for the maintenance programmer to parse.
Also, compilers are really clever pieces of software; sometimes they will actually detect the intent of the code written and be able to use specific constructs meant to make those operations faster, but that relies on it being able to determine what you are trying to do. A perfect example of this is swapping two variables using an intermediary variable, which on IA-32 can be done using XCHG eliminating the intermediary variable, but the compiler has to be able to determine that you are actually doing that and not something clever which may give another result in some cases.
Since the vast majority of the non-explicitly-throwaway software written spends the vast majority of its lifetime in maintenance mode (and lots of throwaway software written is alive and well long past its intended best before date), it makes sense to optimize for maintainability unless that comes at an unacceptable cost in other respects. Of course, if you are evaluating those conditions a trillion times inside a tight loop, targetted optimization very well might make sense. But the profiler will tell you exactly which portions of your code need to be scrutinized more closely from a performance point of view, meaning that you avoid complicating the code unnecessarily.
And the above caveats said, I have been working on code recently making changes that at first glance would almost certainly be considered premature detail optimization. If you have a requirement for high performance and use the profiler to determine which parts of the code are the bottlenecks, then the optimizations aren't premature. (They may still be ill-advised, however, depending on the exact circumstances.)
Depends on your language, I might resort to something like:
$cond = array(true, true, false);
if (count(array_filter($cond)) >= 2)
or
if (array_reduce($cond, function ($i, $k) { return $i + (int)$k; }) >= 2)
There is no absolut answer to this. This depends very much on the underlying architecture. E.g. if you program in VHDL or Verilog some hardware circuit, then for sure the first would give you the fastest result. I assume that your target some kind of CPU, but even here very much will depend on the target cpu, the instruction it supports, and which time they will take. Also you dont specify your target language (e.g. your first approach can be short circuited which can heavily impact speed).
If knowing nothing else I would recommend the second solution - just for the reason that your intentions (at least 2 conditions should be true) are better reflected in the code.
The speed difference of the two solutions would be not very high - if this is just some logic and not the part of some innermost loop that is executed many many many times, I would even guess for premature optimization and try to optimize somewhere else.
You may consider simpy adding them. If you use masroses from standart stdbool.h, then true is 1 and (condition1 + condition2 + condition3) >= 2 is what you want.
But it is still a mere microoptimization, usually you wouldn't get a lot of productivity with this kind of tricks.
Since we're not on a deeply pipelined architecture there's probably no value in branch avoidance, which would normally steer the optimisations offered by desktop developers. Short-cuts are golden, here.
If you go for:
if ((condition1 && (condition2 || condition3)) || (condition2 && condition3))
then you probably have the best chance, without depending on any further information, of getting the best machine code out of the compiler. It's possible, in assembly, to do things like have the second evaluation of condition2 branch back to the first evaluation of condition3 to reduce code size, but there's no reliable way to express this in C.
If you know that you will usually fail the test, and you know which two conditions usually cause that, then you might prefer to write:
if ((rare1 || rare2) && (common3 || (rare1 && rare2)))
but there's still a fairly good chance the compiler will completely rearrange that and use its own shortcut arrangement.
You might like to annotate things with __builtin_expect() or _Rarely() or whatever your compiler provides to indicate the likely outcome of a condition.
However, what's far more likely to meaningfully improve performance is recognising any common factors between the conditions or any way in which the conditions can be tested in a way that simplifies the overall test.
For example, if the tests are simple then in assembly you could almost certainly do some basic trickery with carry to accumulate the conditions quickly. Porting that back to C is sometimes viable.
You seem wiling to evaluate all the conditions, as you proposed such a solution yourself in your question. If the conditions are very complex formulas that take many CPU cycles to compute (like on the order of hundreds of milliseconds), then you may consider evaluating all three conditions simultaneously with threads to get a speed-up. Something like:
pthread_create(&t1, detached, eval_condition1, &status);
pthread_create(&t2, detached, eval_condition2, &status);
pthread_create(&t3, detached, eval_condition3, &status);
pthread_mutex_lock(&status.lock);
while (status.trues < 2 && status.falses < 2) {
pthread_cond_wait(&status.cond, &status.lock);
}
pthread_mutex_unlock(&status.lock);
if (status.trues > 1) {
/* do something */
}
Whether this gives you a speed up depends on how expensive it is to compute the conditions. The compute time has to dominate the thread creation and synchronization overheads.
Try this one:
unsigned char i;
i = condition1;
i += condition2;
i += condition3;
if (i & (unsigned char)0x02)
{
/*
At least 2 conditions are True
0b00 - 0 conditions are true
0b01 - 1 conditions are true
0b11 - 3 conditions are true
0b10 - 2 conditions are true
So, Checking 2nd LS bit is good enough.
*/
}

Loop unrolling & optimization

Given the code :
for (int i = 0; i < n; ++i)
{
A(i) ;
B(i) ;
C(i) ;
}
And the optimization version :
for (int i = 0; i < (n - 2); i+=3)
{
A(i)
A(i+1)
A(i+2)
B(i)
B(i+1)
B(i+2)
C(i)
C(i+1)
C(i+2)
}
Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?
All I see is that each instruction is depending on the previous instruction , meaning that
I need to wait that the previous instruction would finish in order to start the one after ...
Thanks
In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.
In the first case, it's something like:
LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
In the second it's something like:
LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.
Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.
Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.
Well, whether this code is "better" or "worse" totally depends on implementations of A, B and C, which values of n you expect, which compiler you are using and which hardware you are running on.
Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i and comparing it with n) is reduced. In this case, could be reduced by a factor of 3.
As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.
In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.
Generally its not a good idea to try to "invent" optimizations, unless you have hard evidence that you will gain an increase, because many times you may end up introducing a degradation. Typically the best way to obtain such evidence is with a good profiler. I would test both versions of this code with a profiler to see the difference.
Also, many times loop unrolling isnt very protable, as mentioned previously, it depends greatly on the platform, compiler, etc.
You can additionally play with the compiler options. An interesting gcc option is "-floop-optimize", that you get automatically with "-O, -O2, -O3, and -Os"
EDIT Additionally, look at the "-funroll-loops" compiler option.

Resources