Which loop has better performance? Increment or decrement? [duplicate] - c

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is it faster to count down than it is to count up?
which loop has better performance? I have learnt from some where that second is better. But want to know reason why.
for(int i=0;i<=10;i++)
{
/*This is better ?*/
}
for(int i=10;i>=0;i--)
{
/*This is better ?*/
}

The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.

I do not think there is much difference between the performance of both loops.
I suppose, it becomes a different situation when the loops look like this.
for(int i = 0; i < getMaximum(); i++)
{
}
for(int i = getMaximum() - 1; i >= 0; i--)
{
}
As the getMaximum() function is called once or multiple times (assuming it is not an inline function)

Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.

Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.
However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).
A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.
Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.

The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.
A more valid discussion would be whether
for(int i=0;i<10;++i) //preincrement
{
}
would be faster than
for(int i=0;i<10;i++) //postincrement
{
}
Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.
Without optimizations, the code would look like this:
for ( int i = 0; i < 10 ; i++ )
0041165E mov dword ptr [i],0
00411665 jmp wmain+30h (411670h)
00411667 mov eax,dword ptr [i]
0041166A add eax,1
0041166D mov dword ptr [i],eax
00411670 cmp dword ptr [i],0Ah
00411674 jge wmain+68h (4116A8h)
for ( int i = 0; i < 10 ; ++i )
004116A8 mov dword ptr [i],0
004116AF jmp wmain+7Ah (4116BAh)
004116B1 mov eax,dword ptr [i]
004116B4 add eax,1
004116B7 mov dword ptr [i],eax
004116BA cmp dword ptr [i],0Ah
004116BE jge wmain+0B2h (4116F2h)
for ( int i = 9; i >= 0 ; i-- )
004116F2 mov dword ptr [i],9
004116F9 jmp wmain+0C4h (411704h)
004116FB mov eax,dword ptr [i]
004116FE sub eax,1
00411701 mov dword ptr [i],eax
00411704 cmp dword ptr [i],0
00411708 jl wmain+0FCh (41173Ch)
so even in this case, the speed is the same.

Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

Related

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

IF statement ASM and CPU branching

Just using the dissembly window in VS2012:
if(p == 7){
00344408 cmp dword ptr [p],7
0034440C jne main+57h (0344417h)
j = 2;
0034440E mov dword ptr [j],2
}
else{
00344415 jmp main+5Eh (034441Eh)
j = 3;
00344417 mov dword ptr [j],3
}
Am I correct in saying a jump table has been implemented? If so, does this still cause CPU branching problems because the assembly still has to execute the cmp command?
I am looking at the performance costs of IF statements and was wondering if the compiler optimizing to a jump-table means no more CPU branching problems.
There is no jump table here: the two jump instruction are on some absolute address:
jne main+57h (0344417h)
jmp main+5Eh (034441Eh)
There is no indirection. Using a jump table doesn't solve at all the "CPU branching problems". The branch prediction cost with or without jump table should be similar.
I wouldn't call that a jump table. A jump table is an array of destination addresses into which the index is computed dynamically from the user data on which you're switching. The code you showed is just a simple control flow with two alternative branches, with entirely statically coded control flow.
As a typical example, if (X) foo() else bar() becomes (in pseudo-code):
jump_if(!X, Label), foo(), jump(End), Label: bar(), End:
The closest way to express a jump table in pure C or C++ is using an array of function pointers.
switch constructs often become jump tables, although unlike the array of function pointers, those are indirect branch within a function instead of indirect call to a new function.

possible to do if (!boolvar) { ... in 1 asm instruction?

This question is more out of curiousity than necessity:
Is it possible to rewrite the c code if ( !boolvar ) { ... in a way so it is compiled to 1 cpu instruction?
I've tried thinking about this on a theoretical level and this is what I've come up with:
if ( !boolvar ) { ...
would need to first negate the variable and then branch depending on that -> 2 instructions (negate + branch)
if ( boolvar == false ) { ...
would need to load the value of false into a register and then branch depending on that -> 2 instructions (load + branch)
if ( boolvar != true ) { ...
would need to load the value of true into a register and then branch ("branch-if-not-equal") depending on that -> 2 instructions (load + "branch-if-not-equal")
Am I wrong with my assumptions? Is there something I'm overlooking?
I know I can produce intermediate asm versions of programs, but I wouldn't know how to use this in a way so I can on one hand turn on compiler optimization and at the same time not have an empty if statement optimized away (or have the if statement optimized together with its content, giving some non-generic answer)
P.S.: Of course I also searched google and SO for this, but with such short search terms I couldn't really find anything useful
P.P.S.: I'd be fine with a semantically equivalent version which is not syntactical equivalent, e.g. not using if.
Edit: feel free to correct me if my assumptions about the emitted asm instructions are wrong.
Edit2: I've actually learned asm about 15yrs ago, and relearned it about 5yrs ago for the alpha architecture, but I hope my question is still clear enough to figure out what I'm asking. Also, you're free to assume any kind of processor extension common in consumer cpus up to AVX2 (current haswell cpu as of the time of writing this) if it helps in finding a good answer.
At the end of my post it will say why you should not aim for this behaviour (on x86).
As Jerry Coffin has written, most jumps in x86 depend on the flags register.
There is one exception though: The j*cxz set of instructions which jump if the ecx/rcx register is zero. To achieve this you need to make sure that your boolvar uses the ecx register. You can achieve that by specifically assigning it to that register
register int boolvar asm ("ecx");
But by far not all compilers use the j*cxz set of instructions. There is a flag for icc to make it do that, but it is generally not advisable. The Intel manual states that two instructions
test ecx, ecx
jz ...
are faster on the processor.
The reason for being this is that x86 is a CISC (complex) instruction set. In the actual hardware though the processor will split up complex instructions that appear as one instruction in the asm into multiple microinstructions which are then executed in a RISC style. This is the reason why not all instructions require the same execution time and sometimes multiple small ones are faster then one big one.
test and jz are single microinstructions, but jecxz will be decomposed into those two anyways.
The only reason why the j*cxz set of instructions exist is if you want to make a conditional jump without modifying the flags register.
Yes, it's possible -- but doing so will depend on the context in which this code takes place.
Conditional branches in an x86 depend upon the values in the flags register. For this to compile down to a single instruction, some other code will already need to set the correct flag, so all that's left is a single instruction like jnz wherever.
For example:
boolvar = x == y;
if (!boolvar) {
do_something();
}
...could end up rendered as something like:
mov eax, x
cmp eax, y ; `boolvar = x == y;`
jz #f
call do_something
##:
Depending on your viewpoint, it could even compile down to only part of an instruction. For example, quite a few instructions can be "predicated", so they're executed only if some previously defined condition is true. In this case, you might have one instruction for setting "boolvar" to the correct value, followed by one to conditionally call a function, so there's no one (complete) instruction that corresponds to the if statement itself.
Although you're unlikely to see it in decently written C, a single assembly language instruction could include even more than that. For an obvious example, consider something like:
x = 10;
looptop:
-- x;
boolvar = x == 0;
if (!boolvar)
goto looptop;
This entire sequence could be compiled down to something like:
mov ecx, 10
looptop:
loop looptop
Am I wrong with my assumptions
You are wrong with several assumptions. First you should know that 1 instruction is not necessarily faster than multiple ones. For example in newer μarchs test can macro-fuse with jcc, so 2 instructions will run as one. Or a division is so slow that in the same time tens or hundreds of simpler instructions may already finished. Compiling the if block to a single instruction doesn't worth it if it's slower than multiple instructions
Besides, if ( !boolvar ) { ... doesn't need to first negate the variable and then branch depending on that. Most jumps in x86 are based on flags, and they have both the yes and no conditions, so no need to negate the value. We can simply jump on non-zero instead of jump on zero
Similarly if ( boolvar == false ) { ... doesn't need to load the value of false into a register and then branch depending on that. false is a constant equal to 0, which can be embedded as an immediate in the instruction (like cmp reg, 0). But for checking against zero then just a simple test reg, reg is enough. Then jnz or jz will be used to jump on zero/non-zero, which will be fused with the previous test instruction into one
It's possible to make an if header or body that compiles to a single instruction, but it depends entirely on what you need to do, and what condition is used. Because the flag for boolvar may already be available from the previous statement, so the if block in the next line can use it to jump directly like what you see in Jerry Coffin's answer
Moreover x86 has conditional moves, so if inside the if is a simple assignment then it may be done in 1 instruction. Below is an example and its output
int f(bool condition, int x, int y)
{
int ret = x;
if (!condition)
ret = y;
return ret;
}
f(bool, int, int):
test dil, dil ; if(!condition)
mov eax, edx ; ret = y
cmovne eax, esi ; if(condition) ret = x
ret
Some other cases you don't even need a conditional move or jump. For example
bool f(bool condition)
{
bool ret = false;
if (!condition)
ret = true;
return ret;
}
compiles to a single xor without any jump at all
f(bool):
mov eax, edi
xor eax, 1
ret
ARM architecture (v7 and below) can run any instruction as conditional so that may translate to only one instruction
For example the following loop
while (i != j)
{
if (i > j)
{
i -= j;
}
else
{
j -= i;
}
}
can be translated to ARM assembly as
loop: CMP Ri, Rj ; set condition "NE" if (i != j),
; "GT" if (i > j),
; or "LT" if (i < j)
SUBGT Ri, Ri, Rj ; if "GT" (Greater Than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (Less Than), j = j-i;
BNE loop ; if "NE" (Not Equal), then loop

Use a "for" or a "while" loop when only the stop condition is utilized? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why use a for loop instead of a while loop?
I am currently using embedded c. Software that i am using is keil uvision.
So i have a question regarding on which loop will u use?
Both loop does the exact same thing. As long as signal = 0, 'i' will increase by 1.
Firstly,
for(;signal==0;)
{
i++;
}
The next program:
while(signal==0)
{
i++;
}
So which loop would you use and Why? What is the difference between both of them?
Does it have any difference in terms of time taken to execute? Or it is purely based on your preference?
Generally speaking, for loops are preferred when the number of iterations is known (i.e. for each element in an array), and while loops are better for more general conditions when you don't know how many times you'll run the loop. However, a for loop can do anything a while loop can, and vice versa; it all depends on which one makes your code more readable
In this case, a while loop would be preferable, since you're waiting for signal == 0 to become false, and you don't know when that will occur.
Any for loop can be written with a while loop and vice versa. Which you do is mixture of preference, convention, and readability.
Normally, for loops are used for counting and while loops are sort of waiting for a certain condition to be met (like the end of a file). There is no performance difference.
Boilerplate for and while loops:
for(int i = 0; i < someArraysLength; i++)
{
// Modify contents of array
}
while(lineLeftInFile)
{
// Read and parse the line
}
Whichever is easiest to read and understand.
Keep in mind that someone (other than you) might at some point try to read your code.
My opinion: while
Execution time is irrelevant. Any compiler that's worth a damn will generate the exact same code.
Now, as for semantics and such...computationally,
for (init; test; increment) { /* do stuff */ }
is exactly equivalent to
init;
while (test) {
/* do stuff */
increment;
}
And without init and increment, becomes just
while (test) {
/* do stuff */
}
So computationally, the two are identical. Semantically, though, a for loop is for when you have a setup and/or increment stage (particularly if they make for a predictable number of iterations). Since you don't, stick with while.
I agree with mariusnn, whichever is easiest to read and the while would seem to be easier to read to me as well.
Looking at the assembler produced by Visual Studio 2005 when doing a Debug build, the assembler instructions look to be the same for both of these loops. And actually if you do the same loop using an if statement and a label with the true statement being to increment i and then goto the if statement again, it looks like that also generates the same assembler.
for (; signal == 0; ) {
0041139C cmp dword ptr [signal],0
004113A0 jne wmain+3Dh (4113ADh)
i++;
004113A2 mov eax,dword ptr [i]
004113A5 add eax,1
004113A8 mov dword ptr [i],eax
}
004113AB jmp wmain+2Ch (41139Ch)
while (signal == 0) {
004113AD cmp dword ptr [signal],0
004113B1 jne loop (4113BEh)
i++;
004113B3 mov eax,dword ptr [i]
004113B6 add eax,1
004113B9 mov dword ptr [i],eax
}
004113BC jmp wmain+3Dh (4113ADh)
loop: if (signal == 0) {
004113BE cmp dword ptr [signal],0
004113C2 jne loop+11h (4113CFh)
i++;
004113C4 mov eax,dword ptr [i]
004113C7 add eax,1
004113CA mov dword ptr [i],eax
goto loop;
004113CD jmp loop (4113BEh)
}
So which loop would you use and Why?
If I had to choose between the two, I would probably use the while loop, its simpler and cleaner, and it clearly conveys to other developers that the following block of code will be continuously executed until signal is updated.
then again one could do this: for(; signal == 0; i++); it seems to be the more concise, though thats assuming if indeed this will be production code.
Still all this methods seem well a bit dangerous because of overflow, even on embedded devices most clocks are still quite fast and will probably reach the upper bounds of the underlying data type quite soon, then again I don't know if this will be production code nor do I know if that is an acceptable outcome.
What is the difference between both of them?
Like you said both achieve the same goal though Im sure there other ways as I've shown or as others mentioned, the biggest difference between for and while is that one is usually use when we know the number iterations while the other we don't, though for better or worse I've seen some very unique uses of a for its quite flexible.
Does it have any difference in terms of time taken to execute? Or it is purely based on your preference?
As for performance, fundamentally its up to your compiler to decide how to interpret it, it may or may not produce the same binaries, and hence execution time, you could ask it to produce the assemblies or do some profiling.
It seems that uvision 4 IDE http://www.keil.com/product/brochures/uv4.pdf indeed does support disassembly, as noted at page 94, and profiling as noted at page 103, if that indeed is the version you are using.
Though if the difference is small enough please don't sacrifice readability just to squeeze a couple extra nano-secs, this is just my opinion, Im sure there others that would disagree.
The best advice I could give you is this, try as best as you can, to write clear code, meaning most can see what it conveys without much effort, thats efficient and maintainable.

For Loop vs While Loop

In my lecture of Design and Analysis of
Algorithms the instructor said the for loop will take less time then while loop for the following sample algo.
1. for(int i=0;i<5;i++)
{
2. print(i);
}
1. int i=0;
2. while(i<5)
{
3. print(i);
4. i++;
}
He said that the compiler will read the 1. of for while 5 times line 2. 4 times thus total time 5+4=9
But in the case of while loop. The compiler will read the 1. for 1 time,2. for 5 time, 3 for 4time and 4. for 4 time. Thus total time 1+5+4+4 = 14time
Please tell me is this right. Is for loop is faster than while loop?
Thanks.
At least with MSVC 16 (VS 2010) the code is pretty much the same in both cases:
for
; Line 5
xor esi, esi
$LL3#main:
; Line 6
push esi
push OFFSET ??_C#_03PMGGPEJJ#?$CFd?6?$AA#
call _printf
inc esi
add esp, 8
cmp esi, 5
jl SHORT $LL3#main
while
; Line 4
xor esi, esi
$LL2#main:
; Line 6
push esi
push OFFSET ??_C#_03PMGGPEJJ#?$CFd?6?$AA#
call _printf
; Line 7
inc esi
add esp, 8
cmp esi, 5
jl SHORT $LL2#main
Code in my Subversion repository.
In all the modern compilers loop analysis is done on a lower level intermediate representation (i.e., when all the high level loop constructs are expanded into labels and jumps). For a compiler both loops are absolutely equivalent.
I'll pass on performance (hint: no difference, check the generated IR or assembly for proof) however there are two important differences in syntax and maintenance.
Syntax
The scope of the i variable is different. In the for case, the i is only accessible within the for header and body, while in the while case it is available after the loop. As a general rule, it's better to have tighter scopes, less variables in-flight mean less context to worry about when coding.
Maintenance
The for loop has the neat advantage of grouping all the iterations operations close together, so they can be inspected in one shot and so checked.
Also, there is one important difference when introducing continue statements:
for(int i = 0; i != 10; ++i) {
if (array[i] == nullptr) { continue; }
// act on it
}
int i = 0;
while (i != 10) {
if (array[i] == nullptr) { continue; }
// act on it
++i;
}
In the while case, the introduction of continue has created a bug: an infinite loop, as the counter is no longer implemented.
Impact
for loops are more readable and all-around better for regular iteration patterns. Even better, in C++11 the range-for statement:
for (Item const& item : collection) {
}
where iteration is entirely taken care of by the compiler, so you are sure not to mess up! (it makes the for_each algorithm somewhat moot... and we can wish the older for form starts retreating)
By corrolary: while loops should be reserved to irregular iteration patterns, this way they will attract special care during code review and from future maintainer by highlighting the irregularity of the case.

Resources