What effect will branch prediction have on the following C loop?

What effect will branch prediction have on the following C loop? - c

My experience with C is relatively modest, and I lack good understanding of its compiled output on modern CPUs. The context: I'm working on image processing for an Android app. I have read that branch-free machine code is preferred for inner loops, so I'd like to know whether there could be a significant performance difference between something like this:
if (p) { double for loop, computing f() }
else if (q) { double for loop, computing g() }
else { double for loop, computing h() }
Versus the less verbose version which does the condition checking within the loop:
for (int i = 0; i < xRes; i++)
{
for (int j = 0; j < yRes; j++)
{
image[i][j] = p ? f() : (q ? g() : h());
}
}
In this code, p and q are expressions like mode == 3, where mode is passed into the function and never changed within it. I have three simple questions:
(1) Would the first, more verbose version compile to more efficient code than the second version?
(2) For the second version, would performance improve if I evaluate and store the results of p and q above the loop, so I can replace the boolean expressions in the loop with variables?
(3) Should I even be worried about this, or will branch prediction (or some other optimization) ensure the boolean expressions in the loop(s) are almost never evaluated anyway?
Finally, I'd be delighted if someone can say whether the answers to these 3 questions depend on the architecture. I'm interested in the main Android NDK platforms: ARM, MIPS, x86 etc. My thanks in advance!

It looks like the question was already well-answered here: the compiler probably performs loop unswitching, removing the conditional from the loop and automatically generating 3 copies of the loop, just like stark suggested. Moreover, from comments given there and above, it seems branch prediction works very well for loops like these.

Related

Is array value as loop stop condition read every time?

Example:
for (int i = 0; i < a[index]; i++) {
// do stuff
}
Would a[index] be read every time? If no, what if someone wanted to change the value at a[index] in the loop? I've never seen it myself, but does the compiler make such an assumption?
If the condition was instead i < val-2, would it be evaluated every time?

The compiler will perform optimizations normally when the system is not impacted by other parts of the program. So if you make changes inside the for loop on the condition parameter, the compiler will not optimize.
As mentioned, the compiler should read the array and check it before each iteration in your code snippet. You can optimize your code as follows, then it will read the array only once for loop condition checking.
int cond = a[index];
for (int i = 0; i < cond; i++) {
// do stuff
}

well, maybe.
A standards compliant compiler will produce code that behaves as-if it
is read every time.
If index and/or array are of storage class volatile the they will be re-evaluated every time.
If they are not and the loops content doesn't use them in a way that can be expected to modify their value the optimiser may decide to use a cached result instead.

Co does not store results of expressions in temporary variables. So, all expressions re evaluated in-place. Note that any for loop can be changed to a while loop:
for ( def_or_expr1 ; expr2 ; expr3 ) {
...
}
becomes:
def_or_expr1;
while ( expr2 ) {
...
cont:
expr3;
}
Update: continue in the for loop would be the same as goto cont; int the while loop above. I.e. expr3 is evaluated for every iteration.
The compiler can bascially apply any optimization it can proof not to change the program's essence. Describing full details would be too far for this, but in general, it can (and will) optimize:
a[index] is not changed in the loop: read once before loop and keep in a temp (e.g. register).
a[index] is changed in the loop: update the temp (register) with the new value, avoiding memory access (and the index calculations).
For this, the compiler must assume the array is not changed outside the visible control flow. This is typically the file being compiled (with all included files). For modern systems using link time optimization (LTO), this can be the whole final program - minus dynamic libraries.
Note this is a very brief description. Actually, the C standard defines pretty clear how a program has to be executed, so what/how the compiler may optimize.
If the array is changed, for example by an interrupt handler or another thread, things become complicated. Depending on your target, you need from volatile, atomic operations (stdatomic.h, since C11) up to thread locks/mutexes/semapores/etc. to control accesses to the share resource.

How to tell the compiler to unroll this loop [duplicate]

This question already has answers here:
Tell gcc to specifically unroll a loop
(3 answers)
Closed 9 years ago.
I have the following loop that I am running on an ARM processor.
// pin here is pointer to some part of an array
for (i = 0; i < v->numelements; i++)
{
pe = pptr[i];
peParent = pe->parent;
SPHERE *ps = (SPHERE *)(pe->data);
pin[0] = FLOAT2FIX(ps->rad2);
pin[1] = *peParent->procs->pe_intersect == &SphPeIntersect;
fixifyVector( &pin[2], ps->center ); // Is an inline function
pin = pin + 5;
}
By the slow performance of the loop, I can judge that the compiler was unable to unroll this loop, as when I manually do the unrolling, it becomes quite fast. I think the compiler is getting confused by the pin pointer. Can we use restrict keyword to help the compiler here, or is restrict only reserved for function parameters? In general how can we tell the compiler to unroll it and don't worry about the pin pointer.

To tell gcc to unroll all loops you can use the optimization flag -funroll-loops.
To unroll only a specific loop you can use:
__attribute__((optimize("unroll-loops")))
see this answer for more details.
Edit
If the compiler cannot determine the number of iterations of the loop upon entry you will need to use -funroll-all-loops. Note that from the documentation: "Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly."

If you extent pptr size by one, you can use the pld instruction.
__asm__ __volatile__("pld\t[%0]" :: "r" (pptr[i+1]));
Or alternatively you may need to pre-load the next peParent and SPHERE *ps. The loop overhead on an ARM is very small. It is unlikely that un-rolling the loop will be a significant benefit. There are no loop variable constants. It is more likely that the compiler's scheduler is able to fetch advanced data before it is used when you have un-rolled the loop.
You have not presented all of the code to see the data dependencies. There maybe other variables that would benefit from being pre-loaded. Giving a complete example would probably help everyone answer your question.

Should I remove unnecessary `else` in `else if`?

Compare the two:
if (strstr(a, "earth")) // A1
return x;
if (strstr(a, "ear")) // A2
return y;
and
if (strstr(a, "earth")) // B1
return x;
else if (strstr(a, "ear")) // B2
return y;
Personally, I feel that else is redundant and prevent CPU from branch prediction.
In the first one, when executing A1, it's possible to pre-decode A2. And in the second one, it will not interpret B2 until B1 is evaluated to false.
I found a lot of (maybe most of?) sources using the latter form.
Though, the latter form looks better to understand, because it's not so obviously that it will call return y only if a =~ /ear(?!th)/ without the else clause.

Your compiler probably knows that both these examples mean exactly the same thing. CPU branch prediction doesn't come into it.
I usually would choose the first option for symmetry.

(The following answers the original version of the question.)
Do you realize that the two code snippets are NOT semantically equivalent???
Consider what happens if a is "earth".
The first snippet calls foo() and then bar().
The second snippet calls foo() and skips the bar() call.
And this explains why the generated machine code is different. It has to be to implement the different semantics of the respective code fragments!
Personally, I feel that else is redundant ...
Unfortunately, your feeling is incorrect.
Lesson - write your code simply and clearly and leave optimization to the compiler ... which is going to do a far more accurate job than you can achieve.
FOLLOWUP
The snippets in the updated version of the question are now semantically identical, and the else is redundant. However:
any half decent optimizing compiler will generate identical code for the two snippets, and
it is a matter of opinion (i.e. subjective) which of the snippets is easier to understand.

Use else if to state your intentions clearly. Code is meant to be read by humans.
Let the compiler optimize this, and don't worry about optimization until your code is 1) working 2) crystal clear 3) profiled (do this in that order). When doing step 3, you'll notice that the bottlenecks are not where you supposed they would be.
Any attempt to control branch prediction or whatever low level stuff is silly: compilers are very good at optimizing and they use sophisticated methods to yield a fast code on your particular machine.
Look at output from LLVM based compilers to see what I mean: sometimes you can't even remotely understand what it does.

usually it's better to use the second way if you want to test exactly the condition for a, for the exact solution, to reduce the options for the var or const "a". if you write two separate if's you can get 2 different solutions.
for example in your situation with the exact conditions you have there let's say a= -2
A: if (a < 0)
return x; // if -2 is less than 0 will return x and it stops.
else if (a < 100)
return y; //
B: if (a < 0)
return x; // -2 is less than 0 so it will return x and passes to the next if statement;
if (a < 100)
return y; // -2 is also less than 100 and it will return y too

Why not simply write
char* str;
strstr(a, "ear")
if (str != NULL)
{
foo();
if(strstr(str, "earth") != NULL)
{
bar();
}
}

What is the most elegant way to loop TWICE in C

Many times I need to do things TWICE in a for loop. Simply I can set up a for loop with an iterator and go through it twice:
for (i = 0; i < 2; i++)
{
// Do stuff
}
Now I am interested in doing this as SIMPLY as I can, perhaps without an initializer or iterator? Are there any other, really simple and elegant, ways of achieving this?

This is elegant because it looks like a triangle; and triangles are elegant.
i = 0;
here: dostuff();
i++; if ( i == 1 ) goto here;

Encapsulate it in a function and call it twice.
void do_stuff() {
// Do Stuff
}
// .....
do_stuff();
do_stuff();
Note: if you use variables or parameters of the enclosing function in the stuff logic, you can pass them as arguments to the extracted do_stuff function.

If its only twice, and you want to avoid a loop, just write the darn thing twice.
statement1;
statement1; // (again)

If the loop is too verbose for you, you can also define an alias for it:
#define TWICE for (int _index = 0; _index < 2; _index++)
This would result into that code:
TWICE {
// Do Stuff
}
// or
TWICE
func();
I would only recommend to use this macro if you have to do this very often, I think else the plain for-loop is more readable.

Unfortunately, this is not for C, but for C++ only, but does exactly what you want:
Just include the header, and you can write something like this:
10 times {
// Do stuff
}
I'll try to rewrite it for C as well.

So, after some time, here's an approach that enables you to write the following in pure C:
2 times {
do_something()
}
Example:
You'll have to include this little thing as a simple header file (I always called the file extension.h). Then, you'll be able to write programs in the style of:
#include<stdio.h>
#include"extension.h"
int main(int argc, char** argv){
3 times printf("Hello.\n");
3 times printf("Score: 0 : %d\n", _);
2 times {
printf("Counting: ");
9 times printf("%d ", _);
printf("\n");
}
5 times {
printf("Counting up to %d: ", _);
_ times printf("%d ", _);
printf("\n");
}
return 0;
}
Features:
Simple notation of simple loops (in the style depicted above)
Counter is implicitly stored in a variable called _ (a simple underscore).
Nesting of loops allowed.
Restrictions (and how to (partially) circumvent them):
Works only for a certain number of loops (which is - "of course" - reasonable, since you only would want to use such a thing for "small" loops). Current implementation supports a maximum of 18 iterations (higher values result in undefined behaviour). Can be adjusted in header file by changing the size of array _A.
Only a certain nesting depth is allowed. Current implementation supports a nesting depth of 10. Can be adjusted by redefining the macro _Y.
Explanation:
You can see the full (=de-obfuscated) source-code here. Let's say we want to allow up to 18 loops.
Retrieving upper iteration bound: The basic idea is to have an array of chars that are initially all set to 0 (this is the array counterarray). If we issue a call to e.g. 2 times {do_it;}, the macro times shall set the second element of counterarray to 1 (i.e. counterarray[2] = 1). In C, it is possible to swap index and array name in such an assignment, so we can write 2[counterarray] = 1 to acchieve the same. This is exactly what the macro times does as first step. Then, we can later scan the array counterarray until we find an element that is not 0, but 1. The corresponding index is then the upper iteration bound. It is stored in variable searcher. Since we want to support nesting, we have to store the upper bound for each nesting depth separately, this is done by searchermax[depth]=searcher+1.
Adjusting current nesting depth: As said, we want to support nesting of loops, so we have to keep track of the current nesting depth (done in the variable depth). We increment it by one if we start such a loop.
The actual counter variable: We have a "variable" called _ that implicitly gets assigned the current counter. In fact, we store one counter for each nesting depth (all stored in the array counter. Then, _ is just another macro that retrieves the proper counter for the current nesting depth from this array.
The actual for loop: We take the for loop into parts:
We initialize the counter for the current nesting depth to 0 (done by counter[depth] = 0).
The iteration step is the most complicated part: We have to check if the loop at the current nesting depth has reached its end. If so, we have do update the nesting depth accordingly. If not, we have to increment the current nesting depth's counter by 1. The variable lastloop is 1 if this is the last iteration, otherwise 0, and we adjust the current nesting depth accordingly. The main problem here is that we have to write this as a sequence of expressions, all separated by commata, which requires us to write all these conditions in a very non-straight-forward way.
The "increment step" of the for loop consists of only one assignment, that increments the appropriate counter (i.e. the element of counter of the proper nesting depth) and assigns this value to our "counter variable" _.

What about this??
void DostuffFunction(){}
for (unsigned i = 0; i < 2; ++i, DostuffFunction());
Regards,
Pablo.

What abelenky said.
And if your { // Do stuff } is multi-line, make it a function, and call that function -- twice.

Many people suggest writing out the code twice, which is fine if the code is short. There is, however, a size of code block which would be awkward to copy but is not large enough to merit its own function (especially if that function would need an excessive number of parameters). My own normal idiom to run a loop 'n' times is
i = number_of_reps;
do
{
... whatever
} while(--i);
In some measure because I'm frequently coding for an embedded system where the up-counting loop is often inefficient enough to matter, and in some measure because it's easy to see the number of repetitions. Running things twice is a bit awkward because the most efficient coding on my target system
bit rep_flag;
rep_flag = 0;
do
{
...
} while(rep_flag ^= 1); /* Note: if loop runs to completion, leaves rep_flag clear */
doesn't read terribly well. Using a numeric counter suggests the number of reps can be varied arbitrarily, which in many instances won't be the case. Still, a numeric counter is probably the best bet.

As Edsger W. Dijkstra himself put it : "two or more, use a for". No need to be any simpler.

Another attempt:
for(i=2;i--;) /* Do stuff */
This solution has many benefits:
Shortest form possible, I claim (13 chars)
Still, readable
Includes initialization
The amount of repeats ("2") is visible in the code
Can be used as a toggle (1 or 0) inside the body e.g. for alternation
Works with single instruction, instruction body or function call
Flexible (doesn't have to be used only for "doing twice")
Dijkstra compliant ;-)
From comment:
for (i=2; i--; "Do stuff");

Use function:
func();
func();
Or use macro (not recommended):
#define DO_IT_TWICE(A) A; A
DO_IT_TWICE({ x+=cos(123); func(x); })

If your compiler supports this just put the declaration inside the for statement:
for (unsigned i = 0; i < 2; ++i)
{
// Do stuff
}
This is as elegant and efficient as it can be. Modern compilers can do loop unrolling and all that stuff, trust them. If you don't trust them, check the assembler.
And it has one little advantage to all other solutions, for everybody it just reads, "do it twice".

Assuming C++0x lambda support:
template <typename T> void twice(T t)
{
t();
t();
}
twice([](){ /*insert code here*/ });
Or:
twice([]()
{
/*insert code here*/
});
Which doesn't help you since you wanted it for C.

Good rule: three or more, do a for.
I think I read that in Code Complete, but I could be wrong. So in your case you don't need a for loop.

This is the shortest possible without preprocessor/template/duplication tricks:
for(int i=2; i--; ) /*do stuff*/;
Note that the decrement happens once right at the beginning, which is why this will loop precisely twice with the indices 1 and 0 as requested.
Alternatively you can write
for(int i=2; i--; /*do stuff*/) ;
But that's purely a difference of taste.

If what you are doing is somewhat complicated wrap it in a function and call that function twice? (This depends on how many local variables your do stuff code relies on).
You could do something like
void do_stuff(int i){
// do stuff
}
do_stuff(0);
do_stuff(1);
But this may get extremely ugly if you are working on a whole bunch of local variables.

//dostuff
stuff;
//dostuff (Attention I am doing the same stuff for the :**2nd** time)
stuff;

First, use a comment
/* Do the following stuff twice */
then,
1) use the for loop
2) write the statement twice, or
3) write a function and call the function twice
do not use macros, as earlier stated, macros are evil.
(My answer's almost a triangle)

What is elegance? How do you measure it? Is someone paying you to be elegant? If so how do they determine the dollar-to-elegance conversion?
When I ask myself, "how should this be written," I consider the priorities of the person paying me. If I'm being paid to write fast code, control-c, control-v, done. If I'm being paid to write code fast, well.. same thing. If I'm being paid to write code that occupies the smallest amount of space on the screen, I short the stock of my employer.

jump instruction is pretty slow,so if you write the lines one after the other,it would work faster,than writing a loop. but modern compilers are very,very smart and the optimizations are great (if they are allowed,of course). if you have turned on your compiler's optimizations,you don't care the way,you write it - with loop or not (:
EDIT : http://en.wikipedia.org/wiki/compiler_optimizations just take a look (:

Close to your example, elegant and efficient:
for (i = 2; i; --i)
{
/* Do stuff */
}
Here's why I'd recommend that approach:
It initializes the iterator to the number of iterations, which makes intuitive sense.
It uses decrement over increment so that the loop test expression is a comparison to zero (the "i;" can be interpreted as "is i true?" which in C means "is i non-zero"), which may optimize better on certain architectures.
It uses pre-decrement as opposed to post-decrement in the counting expression for the same reason (may optimize better).
It uses a for loop instead of do/while or goto or XOR or switch or macro or any other trick approach because readability and maintainability are more elegant and important than clever hacks.
It doesn't require you to duplicate the code for "Do stuff" so that you can avoid a loop. Duplicated code is an abomination and a maintenance nightmare.
If "Do stuff" is lengthy, move it into a function and give the compiler permission to inline it if beneficial. Then call the function from within the for loop.

I like Chris Case's solution (up here), but C language doesn't have default parameters.
My solution:
bool cy = false;
do {
// Do stuff twice
} while (cy = !cy);
If you want, you could do different things in the two cycle by checking the boolean variable (maybe by ternary operator).

void loopTwice (bool first = true)
{
// Recursion is your friend
if (first) {loopTwice(false);}
// Do Stuff
...
}
I'm sure there's a more elegant way, but this is simple to read, and pretty simply to write. There might even be a way to eliminate the bool parameter, but this is what I came up with in 20 seconds.

Efficiency of boolean comparisons? In C

I'm writing a loop in C, and I am just wondering on how to optimize it a bit. It's not crucial here as I'm just practicing, but for further knowledge, I'd like to know:
In a loop, for example the following snippet:
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
If it checks both, wouldn't:
int i = 0;
while (i != 10) {
printf("%d\n", i);
i++;
}
be more efficient?
Thanks!

Both will be translated in a single assembly instruction. Most CPUs have comparison instructions for LESS THAN, for LESS THAN OR EQUAL, for EQUAL and for NOT EQUAL.

One of the interesting things about these optimization questions is that they often show why you should code for clarity/correctness before worrying about the performance impact of these operations (which oh-so often don't have any difference).
Your 2 example loops do not have the same behavior:
int i = 0;
/* this will print 11 lines (0..10) */
while (i <= 10) {
printf("%d\n", i);
i++;
}
And,
int i = 0;
/* This will print 10 lines (0..9) */
while (i != 10) {
printf("%d\n", i);
i++;
}
To answer your question though, it's nearly certain that the performance of the two constructs would be identical (assuming that you fixed the problem so the loop counts were the same). For example, if your processor could only check for equality and whether one value were less than another in two separate steps (which would be a very unusual processor), then the compiler would likely transform the (i <= 10) to an (i < 11) test - or maybe an (i != 11) test.

This a clear example of early optimization.... IMHO, that is something that programmers new to their craft are way to prone to worry about. If you must worry about it, learn to benchmark and profile your code so that your worries are based on evidence rather than supposition.
Speaking to your specific questions. First, a <= is not implemented as two operations testing for < and == separately in any C compiler I've met in my career. And that includes some monumentally stupid compilers. Notice that for integers, a <= 5 is the same condition as a < 6 and if the target architecture required that only < be used, that is what the code generator would do.
Your second concern, that while (i != 10) might be more efficient raises an interesting issue of defensive programming. First, no it isn't any more efficient in any reasonable target architecture. However, it raises a potential for a small bug to cause a larger failure. Consider this: if some line of code within the body of the loop modified i, say by making it greater than 10, what might happen? How long would it take for the loop to end, and would there be any other consequences of the error?
Finally, when wondering about this kind of thing, it often is worthwhile to find out what code the compiler you are using actually generates. Most compilers provide a mechanism to do this. For GCC, learn about the -S option which will cause it to produce the assembly code directly instead of producing an object file.

The operators <= and < are a single instruction in assembly, there should be no performance difference.
Note that tests for 0 can be a bit faster on some processors than to test for any other constant, therefore it can be reasonable to make a loop run backward:
int i = 10;
while (i != 0)
{
printf("%d\n", i);
i--;
}
Note that micro optimizations like these usually can gain you only very little more performance, better use your time to use efficient algorithms.

Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
Neither, it will most likely check (i < 11). The <= 10 is just there for you to give better meaning to your code since 11 is a magic number which actually means (10+1).

Depends on the architecture and compiler. On most architectures, there is a single instruction for <= or the opposite, which can be negated, so if it is translated into a loop, the comparison will most likely be only one instruction. (On x86 or x86_64 it is one instruction)
The compiler might unroll the loop into a sequence of ten times i++, when only constant expressions are involved it will even optimize the ++ away and leave only constants.
And Ira is right, the comparison does vanish if there is a printf involved, which execution time might be millions of clock cycles.

I'm writing a loop in C, and I am just wondering on how to optimize it a bit.
If you compile with optimizations turned on, the biggest optimization will be from unrolling that loop.
It's going to be hard to profile that code with -O2, because for trivial functions the compiler will unroll the loop and you won't be able to benchmark actual differences in compares. You should be careful when profiling test cases that use constants that might make the code trivial when optimized by the compiler.

disassemble. Depending on the processor, and optimization and a number of things this simple example code actually unrolls or does things that do not reflect your real question. Compiling with gcc -O1 though both example loops you provided resulted in the same assembler (for arm).
Less than in your C code often turns into a branch if greater than or equal to the far side of the loop. If your processor doesnt have a greater than or equal it may have a branch if greater than and a branch if equal, two instructions.
typically though there will be a register holding i. there will be an instruction to increment i. Then an instruction to compare i with 10, then equal to, greater than or equal, and less than are generally done in a single instruction so you should not normally see a difference.

// Case I
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
printf("%d\n", i);
i++;
}
// Case II
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Case I code take more space but fast and Case II code is take less space but slow compare to Case I code.
Because in programming space complexity and time complexity always proportional to each other. It means you must compromise either space or time.
So in that way you can optimize your time complexity or space complexity but not both.
And your both code are same.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What effect will branch prediction have on the following C loop? - c

Related

Is array value as loop stop condition read every time?

How to tell the compiler to unroll this loop [duplicate]

Should I remove unnecessary `else` in `else if`?

What is the most elegant way to loop TWICE in C

Efficiency of boolean comparisons? In C

Categories

Resources