Cache miss ? how can I see that? - c

Given the following code :
for (int i=0; i<n; i++)
{
counter += myArray[i];
}
And the Loop unrolling version :
for (int i=0; i<n; i+=4)
{
counter1 += myArray[i+0];
counter2 += myArray[i+1];
counter3 += myArray[i+2];
counter4 += myArray[i+3];
}
total = counter1+ counter2 + counter3+ counter4;
Why do we have a cache miss in the first version ?
Is the second version has indeed a better performance than the 1st ? why ?
Regards

Why do we have a cache miss in the first version ?
As Oli points out in the comments. This question is unfounded. If the data is already in the cache, then there will be no cache misses.
That aside, there is no difference in memory access between your two examples. So that will not likely be a factor in any performance difference between them.
Is the second version has indeed a better performance than the 1st ? why ?
Usually, the thing to do is to actually measure. But in this particular example, I'd say that it will likely be faster. Not because of better cache access, but because of the loop-unrolling.
The optimization that you are doing is called "Node-Splitting", where you separate the counter variable for the purpose of breaking the dependency chain.
However, in this case, you are doing a trivial reduction operation. Many modern compilers are able to recognize this pattern and do this node-splitting for you.
So is it faster? Most likely. But you should check to see if the compiler does it for you.
For the record: I just tested this on Visual Studio 2010.And I am quite surprised that it is not able to do this optimization.
; 129 :
; 130 : int counter = 0;
; 131 :
; 132 : for (int i=0; i<n; i++)
mov ecx, DWORD PTR n$[rsp]
xor edx, edx
test ecx, ecx
jle SHORT $LN1#main
$LL3#main:
; 133 : {
; 134 : counter += myArray[i];
add edx, DWORD PTR [rax]
add rax, 4
dec rcx
jne SHORT $LL3#main
$LN1#main:
; 135 : }
Visual Studio 2010 does not seem to be capable of performing "Node Splitting" for this (trivial) example...

Related

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Program dies on casting int to double. Can't figure out why it segfaults [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
For IP reasons I cannot post the actual code, but here's the gist:
...
double valueA = 0.0;
double valueB = 0.0;
section_t * section = &some_global_table[counter].section;
if (NULL == section) continue;
else
{
for (subsecnum = 0; subsecnum < section->entries; subsecnum++)
{
valueA = (double) section->subsection[subsecnum].value //CRASHES HERE
valueB = (double) section->subsection[subsecnum+1].value; // subsecnum + 1 is a valid entry
...//do something with values//...
}
}
...
The above code is called multiple times, depending on the section required,
Recently I was stress testing our application using jmeter - 150 threads on a continuous loop (it's a server app), and it crashed (SIGSEGV). Running it through GDB pointed me to the line marked //CRASHES HERE. I've run it through GDB a few times after and it always crashes at the same point.
However: it does NOT always crash on the values in the table. For example, the first time it crashed:
counter = 2
subsecnum = 21
the second time it crashed:
counter = 19
subsecnum = 10
and so on...
I've checked and double checked the values for out-of-bounds errors, but that is not it. The values are all valid.
NOTE: I found that if I actually copied the entire some_global_table[counter].section to a buffer instead of just using a pointer, there is no crash. However, even using a mutex around the read section did not work...
Any help is really appreciated, and if any more detail is required, please let me know.
EDIT: The global table is loaded in the beginning, and not changed at any point after, therefore the value of section->entries for a particular section will always be the same once the data is loaded.
EDIT2: Structure for section_t
typedef struct
{
int entries;
subsection_t * subsections;
} section_t;
typedef struct
{
int value;
char title[MAX_LEN_TITLE];
} subsection_t;
typedef struct
{
char bookname[MAX_LEN_BOOK_TITLE];
FILE * bookfile;
section_t section;
} global_table_t;
global_table_t some_global_table[MAX_TABLES];
EDIT3:
Dump of assembler code from 0x4132a1 to 0x413321:
0x00000000004132a1 <myfunc+389>: roll 0x0(%rcx)
0x00000000004132a4 <myfunc+392>: mov $0x0,%eax
0x00000000004132a9 <myfunc+397>: callq 0x408382 <log>
0x00000000004132ae <myfunc+402>: jmpq 0x413517 <myfunc+1019>
0x00000000004132b3 <myfunc+407>: mov -0x68(%rbp),%rax
0x00000000004132b7 <myfunc+411>: mov (%rax),%rax
0x00000000004132ba <myfunc+414>: sub $0x1,%eax
0x00000000004132bd <myfunc+417>: mov %eax,-0xc(%rbp)
0x00000000004132c0 <myfunc+420>: movl $0x0,-0x5c(%rbp)
0x00000000004132c7 <myfunc+427>: jmpq 0x413505 <myfunc+1001>
0x00000000004132cc <myfunc+432>: mov -0x68(%rbp),%rax
0x00000000004132d0 <myfunc+436>: mov 0x10(%rax),%rdx
0x00000000004132d4 <myfunc+440>: mov -0x5c(%rbp),%eax
0x00000000004132d7 <myfunc+443>: cltq
0x00000000004132d9 <myfunc+445>: shl $0x4,%rax
0x00000000004132dd <myfunc+449>: lea (%rdx,%rax,1),%rax
=> 0x00000000004132e1 <myfunc+453>: mov 0x8(%rax),%eax
0x00000000004132e4 <myfunc+456>: mov %eax,-0x8(%rbp)
0x00000000004132e7 <myfunc+459>: mov -0x68(%rbp),%rax
0x00000000004132eb <myfunc+463>: mov 0x10(%rax),%rax
0x00000000004132ef <myfunc+467>: lea 0x10(%rax),%rdx
0x00000000004132f3 <myfunc+471>: mov -0x5c(%rbp),%eax
0x00000000004132f6 <myfunc+474>: cltq
0x00000000004132f8 <myfunc+476>: shl $0x4,%rax
0x00000000004132fc <myfunc+480>: lea (%rdx,%rax,1),%rax
0x0000000000413300 <myfunc+484>: mov 0x8(%rax),%eax
0x0000000000413303 <myfunc+487>: mov %eax,-0x4(%rbp)
0x0000000000413306 <myfunc+490>: cvtsi2sdl -0x8(%rbp),%xmm0
0x000000000041330b <myfunc+495>: movsd %xmm0,-0x50(%rbp)
0x0000000000413310 <myfunc+500>: cvtsi2sdl -0x4(%rbp),%xmm0
0x0000000000413315 <myfunc+505>: movsd %xmm0,-0x40(%rbp)
0x000000000041331a <myfunc+510>: mov -0x68(%rbp),%rax
0x000000000041331e <myfunc+514>: mov 0x10(%rax),%rdx
rax 0xa80 2688
rbx 0x7fffc03f9710 140736418780944
rcx 0x4066c00000000000 4640607572284407808
rdx 0x0 0
rsi 0xfffff00000000 4503595332403200
rdi 0x7fffc039e8f0 140736418408688
rbp 0x7fffc039e9f0 0x7fffc039e9f0
rsp 0x7fffc039e950 0x7fffc039e950
r8 0x13 19
r9 0x1 1
r10 0x9 9
r11 0x7fffc039e848 140736418408520
r12 0x7fffedd86d60 140737183772000
r13 0x7fffc03f99d0 140736418781648
r14 0x4 4
r15 0x7 7
rip 0x4132e1 0x4132e1 <myfunc+453>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
My conjecture, and yes it is a stretch, is that it isn't necessarily the subsection that is wrong; it is the counter argument and the subsequent dereferences that ensue. You have a counter that is looping through what we hope is your global table. One would hope it is not exceeding MAX_TABLES-1, as doing so introduces undefined behavior. Though your sample did not include the loop I can only assume it looks something like this:
size_t counter=0;
for (;counter < some_upper_limit; ++counter)
{
double valueA = 0.0;
double valueB = 0.0;
section_t * section = &some_global_table[counter].section;
if (NULL == section)
continue;
else
{
for (subsecnum = 0; subsecnum < section->entries; subsecnum++)
{
valueA = (double) section->subsection[subsecnum].value //CRASHES HERE
valueB = (double) section->subsection[subsecnum+1].value; // subsecnum + 1 is a valid entry
...//do something with values//...
}
}
}
Note the check for NULL? The question is, Why take the address of a fixed member of a structure in a fixed global array of those structures, then "validate" it against NULL??
This literally looks like you're assuming if counter is an index not within [0..MAX_TABLES-1] then the address of the structure held at the array dereferenced with that index will somehow be NULL. That cannot be guaranteed. Chances are the memory you're referencing is "valid", but certainly not defined.
Therefore, you are now toting around a completely illegitimate pointer, which may go kerboom as soon as it is dereferenced (or spawn a chorus line of sewer rats singing a chorus line of "One"; thus the nature of undefined behavior =).
Everyone has been cajoling around the idea that subsection[subsecnum] is somehow the root of the cause of this, but I submit to you that section-> is the real problem, because section is garbage, and section is garbage because an undefined assumption of an out-of-range induced array index (counter) made it so.
So how could counter be bad? One way would be concurrency. If this indeed is a multi-threaded application and counter is a variable somehow scoped for access from multiple threads concurrently, it is not protected at all. One loop could increment it after another loop tested it, thereby invalidating the latter's test. It may be the very reason you thought putting the NULL-check in was a way to circumvent that concurrency side-effect. I honestly don't know.
But that is where I would start looking. Dump counter to a debug log if it is not used concurrently. Make sure it is within range. If it is concurrently accessed, make sure it is protected.
I totally agree with WhozCraig. Additionally:
// OK...
for (subsecnum = 0; subsecnum < section->entries; subsecnum++) {
// Also OK (provided "section" and "subsection" are both allocated and initialized)
valueA = (double) section->subsection[subsecnum].value //CRASHES HERE
// Are you *sure* "subsecnum + 1" is a valid entry?
valueB = (double) section->subsection[subsecnum+1].value;
ALSO:
"Gdb", as it sounds like you already know, is your Friend. It wouldn't help to single-step through your loop, and "print" array and pointer references at various points to make sure everything's OK (and stays OK).
IMHO...
Since you mentioned '150 threads' I would guess you have a race condition -- one thread is modifying (perhaps freeing) the section_t while another thread is accessing it. This would explain why copying things makes the bug appear to go away -- that makes the race hole much smaller.
Since you can get a debugger attached at the crash, try examining the section_t (p *section) and try to figure out what it looks like.
Without the full context, its hard to say. One thing I would recommend doing is to run your server program under Valgrind to check if you are indeed running into a memory overrun or not. Since you are doing array accesses, I would suspect something amiss there. As commenters have pointed out, I doubt its any issue with teh casting.

Why is my application not able to reach core i7 920 peak FP performance

i have a question about the FP peak performance of my core i7 920.
I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions.
When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:
for(i=0; i<49335264; i++)
{
data[i] += other_data[i] * other_data2[i];
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPs
It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec
Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? Because i cannot explain it.
Thanks a lot in advance, and i could really use your help!
I would use SSE.
Edit: I run some more tests by myself and discovered that your program is neither limited by memory bandwidth (the theoretical limit is about 3-4 times higher than your result) nor floating point performance (with an even higher limit), it is limited by lazy allocation of memory pages by the OS.
#include <chrono>
#include <iostream>
#include <x86intrin.h>
using namespace std::chrono;
static const unsigned size = 49335264;
float data[size], other_data[size], other_data2[size];
int main() {
#if 0
for(unsigned i=0; i<size; i++) {
data[i] = i;
other_data[i] = i;
other_data2[i] = i;
}
#endif
system_clock::time_point start = system_clock::now();
for(unsigned i=0; i<size; i++)
data[i] += other_data[i]*other_data2[i];
microseconds timeUsed = system_clock::now() - start;
std::cout << "Used " << timeUsed.count() << " us, "
<< 2*size/(timeUsed.count()/1e6*1e9) << " GFLOPS\n";
}
Translate with g++ -O3 -march=native -std=c++0x. The program gives
Used 212027 us, 0.465368 GFLOPS
as output, although the hot loop translates to
400848: vmovaps 0xc234100(%rdx),%ymm0
400850: vmulps 0x601180(%rdx),%ymm0,%ymm0
400858: vaddps 0x17e67080(%rdx),%ymm0,%ymm0
400860: vmovaps %ymm0,0x17e67080(%rdx)
400868: add $0x20,%rdx
40086c: cmp $0xbc32f80,%rdx
400873: jne 400848 <main+0x18>
This means it is fully vectorized, using 8 floats per iteration and even taking advantage of AVX.
After playing around with streaming instruction like movntdq, which don't bought anything, I decided to actually initialize the arrays with something - otherwise they will be zero pages, which only get mapped to real memory if they are written to. Changing the #if 0 to #if 1 immediately yields
Used 48843 us, 2.02016 GFLOPS
Which comes pretty close to the memory bandwith of the system (4 floats a 4 bytes per two FLOPS = 16 GBytes/s, theoretical limit is 2 Channels of DDR3 each 10,667 GBytes/s).
The explanation is simple: while your processor can run at (say) 6.4GHz, your memory sub-system can only feed data in/out at about 1/10th that rate (broad rule-of-thumb for most current commodity CPUs). So achieving a sustained flops rate of 1/8th of the theoretical maximum for your processor is actually very good performance.
Since you seem to be dealing with about 370MB of data, which is probably larger than the caches on your processor, your computation is I/O bound.
As High Performance Mark explained, your test is very likely to be memory bound rather than compute-bound.
One thing I'd like to add is that to quantify this effect, you can modify the test so that it operates on data that fits into the L1 cache:
for(i=0, j=0; i<6166908; i++)
{
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
if ((j & 1023) == 0) j = 0;
}
The performance of this version of the code should be closer to the theoretical maximum of FLOPS. Of course, it presumably doesn't solve your original problem, but hopefully it can help understand what's going on.
I looked at the assembly code of the multiply-accumulate of the code snippet in my first post and it looks like:
movq 0x80(%rbx), %rcx
movq 0x138(%rbx), %rdi
movq 0x120(%rbx), %rdx
movq (%rcx), %rsi
movq 0x8(%rdi), %r8
movq 0x8(%rdx), %r9
movssl 0x1400(%rsi), %xmm0
mulssl 0x90(%r8), %xmm0
addssl 0x9f8(%r9), %xmm0
movssl %xmm0, 0x9f8(%r9)
I estimated from the total number of cycles that it takes ~10 cycles to execute the multiply-accumulate.
The problem seems to be that the compiler is unable to pipeline the execution of the loop, even though there are no inter-loop dependencies, am i correct?
Does anybody have any other ideas / solutions for this?
Thanks for the help so far!

For Loop vs While Loop

In my lecture of Design and Analysis of
Algorithms the instructor said the for loop will take less time then while loop for the following sample algo.
1. for(int i=0;i<5;i++)
{
2. print(i);
}
1. int i=0;
2. while(i<5)
{
3. print(i);
4. i++;
}
He said that the compiler will read the 1. of for while 5 times line 2. 4 times thus total time 5+4=9
But in the case of while loop. The compiler will read the 1. for 1 time,2. for 5 time, 3 for 4time and 4. for 4 time. Thus total time 1+5+4+4 = 14time
Please tell me is this right. Is for loop is faster than while loop?
Thanks.
At least with MSVC 16 (VS 2010) the code is pretty much the same in both cases:
for
; Line 5
xor esi, esi
$LL3#main:
; Line 6
push esi
push OFFSET ??_C#_03PMGGPEJJ#?$CFd?6?$AA#
call _printf
inc esi
add esp, 8
cmp esi, 5
jl SHORT $LL3#main
while
; Line 4
xor esi, esi
$LL2#main:
; Line 6
push esi
push OFFSET ??_C#_03PMGGPEJJ#?$CFd?6?$AA#
call _printf
; Line 7
inc esi
add esp, 8
cmp esi, 5
jl SHORT $LL2#main
Code in my Subversion repository.
In all the modern compilers loop analysis is done on a lower level intermediate representation (i.e., when all the high level loop constructs are expanded into labels and jumps). For a compiler both loops are absolutely equivalent.
I'll pass on performance (hint: no difference, check the generated IR or assembly for proof) however there are two important differences in syntax and maintenance.
Syntax
The scope of the i variable is different. In the for case, the i is only accessible within the for header and body, while in the while case it is available after the loop. As a general rule, it's better to have tighter scopes, less variables in-flight mean less context to worry about when coding.
Maintenance
The for loop has the neat advantage of grouping all the iterations operations close together, so they can be inspected in one shot and so checked.
Also, there is one important difference when introducing continue statements:
for(int i = 0; i != 10; ++i) {
if (array[i] == nullptr) { continue; }
// act on it
}
int i = 0;
while (i != 10) {
if (array[i] == nullptr) { continue; }
// act on it
++i;
}
In the while case, the introduction of continue has created a bug: an infinite loop, as the counter is no longer implemented.
Impact
for loops are more readable and all-around better for regular iteration patterns. Even better, in C++11 the range-for statement:
for (Item const& item : collection) {
}
where iteration is entirely taken care of by the compiler, so you are sure not to mess up! (it makes the for_each algorithm somewhat moot... and we can wish the older for form starts retreating)
By corrolary: while loops should be reserved to irregular iteration patterns, this way they will attract special care during code review and from future maintainer by highlighting the irregularity of the case.

Which loop has better performance? Increment or decrement? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is it faster to count down than it is to count up?
which loop has better performance? I have learnt from some where that second is better. But want to know reason why.
for(int i=0;i<=10;i++)
{
/*This is better ?*/
}
for(int i=10;i>=0;i--)
{
/*This is better ?*/
}
The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.
I do not think there is much difference between the performance of both loops.
I suppose, it becomes a different situation when the loops look like this.
for(int i = 0; i < getMaximum(); i++)
{
}
for(int i = getMaximum() - 1; i >= 0; i--)
{
}
As the getMaximum() function is called once or multiple times (assuming it is not an inline function)
Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.
Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.
However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).
A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.
Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.
The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.
A more valid discussion would be whether
for(int i=0;i<10;++i) //preincrement
{
}
would be faster than
for(int i=0;i<10;i++) //postincrement
{
}
Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.
Without optimizations, the code would look like this:
for ( int i = 0; i < 10 ; i++ )
0041165E mov dword ptr [i],0
00411665 jmp wmain+30h (411670h)
00411667 mov eax,dword ptr [i]
0041166A add eax,1
0041166D mov dword ptr [i],eax
00411670 cmp dword ptr [i],0Ah
00411674 jge wmain+68h (4116A8h)
for ( int i = 0; i < 10 ; ++i )
004116A8 mov dword ptr [i],0
004116AF jmp wmain+7Ah (4116BAh)
004116B1 mov eax,dword ptr [i]
004116B4 add eax,1
004116B7 mov dword ptr [i],eax
004116BA cmp dword ptr [i],0Ah
004116BE jge wmain+0B2h (4116F2h)
for ( int i = 9; i >= 0 ; i-- )
004116F2 mov dword ptr [i],9
004116F9 jmp wmain+0C4h (411704h)
004116FB mov eax,dword ptr [i]
004116FE sub eax,1
00411701 mov dword ptr [i],eax
00411704 cmp dword ptr [i],0
00411708 jl wmain+0FCh (41173Ch)
so even in this case, the speed is the same.
Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

Resources