When I have a loop like:
for (int i = 0; i < SlowVariable; i++)
{
//
}
I know that in VB6 the SlowVariable is accessed every iteration of the loop, making the following much more efficient:
int cnt = SlowVariable;
for (int i = 0; i < cnt; i++)
{
//
}
Do I need the make the same optimizations in GCC? Or does it evaluate SlowVariable only once?
This is called "hoisting" SlowVariabele out of the loop.
The compiler can do it only if it can prove that the value of SlowVariabele is the same every time, and that evaluating SlowVariabele has no side-effects.
So for example consider the following code (I assume for the sake of example that accessing through a pointer is "slow" for some reason):
void foo1(int *SlowVariabele, int *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The compiler cannot (in general) hoist, because for all it knows it will be called with result == SlowVariabele, and so the value of *SlowVariabele is changing during the loop.
On the other hand:
void foo2(int *result) {
int val = 12;
int *SlowVariabele = &val;
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
Now at least in principle, the compiler can know that val never changes in the loop, and so it can hoist. Whether it actually does so is a matter of how aggressive the optimizer is and how good its analysis of the function is, but I'd expect any serious compiler to be capable of it.
Similarly, if foo1 was called with pointers that the compiler can determine (at the call site) are non-equal, and if the call is inlined, then the compiler could hoist. That's what restrict is for:
void foo3(int *restrict SlowVariabele, int *restrict result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
restrict (introduced in C99) means "you must not call this function with result == SlowVariabele", and allows the compiler to hoist.
Similarly:
void foo4(int *SlowVariabele, float *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The strict aliasing rules mean that SlowVariable and result must not refer to the same location (or the program has undefined behaviour anyway), and so again the compiler can hoist.
Generally, variables can't be slow (or fast) unless they are mapped to some weird kind of memory (you usually want to declare them volatile in this case).
But indeed, using a local variable creates more opportunities for optimization, and the effect may be very visible. The compiler can "cache" a global variable by itself only if it's able to prove that no function called within a loop can read or write that global variable. When you call an external function within a loop, the compiler probably won't be able to prove such a thing.
This depends on how the compiler optimize, for example here:
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned int i;
unsigned int z = 10;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
return 0;
}
If you compiled it using gcc example.c -o example, the result code will be:
0x0040138c <+0>: push ebp
0x0040138d <+1>: mov ebp,esp
0x0040138f <+3>: and esp,0xfffffff0
0x00401392 <+6>: sub esp,0x20
0x00401395 <+9>: call 0x4018f4 <__main>
0x0040139a <+14>: mov DWORD PTR [esp+0x18],0xa
0x004013a2 <+22>: mov DWORD PTR [esp+0x1c],0x0
0x004013aa <+30>: jmp 0x4013c4 <main+56>
0x004013ac <+32>: mov eax,DWORD PTR [esp+0x1c]
0x004013b0 <+36>: mov DWORD PTR [esp+0x4],eax
0x004013b4 <+40>: mov DWORD PTR [esp],0x403064
0x004013bb <+47>: call 0x401b2c <printf>
0x004013c0 <+52>: inc DWORD PTR [esp+0x1c]
0x004013c4 <+56>: mov eax,DWORD PTR [esp+0x1c] ; (1)
0x004013c8 <+60>: cmp eax,DWORD PTR [esp+0x18] ; (2)
0x004013cc <+64>: jb 0x4013ac <main+32>
0x004013ce <+66>: mov eax,0x0
0x004013d3 <+71>: leave
0x004013d4 <+72>: ret
0x004013d5 <+73>: nop
0x004013d6 <+74>: nop
0x004013d7 <+75>: nop
The value of i will be movied from the stack into eax.
Then the CPU will compare eax or i, with the value of z, which is in the stack.
All of this happen on every round.
If you optimized the code using gcc -O2 example.c -o example, the result will be:
0x00401b70 <+0>: push ebp
0x00401b71 <+1>: mov ebp,esp
0x00401b73 <+3>: push ebx
0x00401b74 <+4>: and esp,0xfffffff0
0x00401b77 <+7>: sub esp,0x10
0x00401b7a <+10>: call 0x4018a8 <__main>
0x00401b7f <+15>: xor ebx,ebx
0x00401b81 <+17>: lea esi,[esi+0x0]
0x00401b84 <+20>: mov DWORD PTR [esp+0x4],ebx
0x00401b88 <+24>: mov DWORD PTR [esp],0x403064
0x00401b8f <+31>: call 0x401ae0 <printf>
0x00401b94 <+36>: inc ebx
0x00401b95 <+37>: cmp ebx,0xa ; (1)
0x00401b98 <+40>: jne 0x401b84 <main+20>
0x00401b9a <+42>: xor eax,eax
0x00401b9c <+44>: mov ebx,DWORD PTR [ebp-0x4]
0x00401b9f <+47>: leave
0x00401ba0 <+48>: ret
0x00401ba1 <+49>: nop
0x00401ba2 <+50>: nop
0x00401ba3 <+51>: nop
The compiler knows that there is no point of checking the value of z, so it modifies the code to something like for( i = 0 ; i < 10 ; i++ ).
In case the compiler doesn't konw the value of z like in this code:
#include <stdio.h>
void loop(unsigned int z) {
unsigned int i;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
}
int main(int argc, char **argv)
{
unsigned int z = 10;
loop(z);
return 0;
}
The result will be:
0x0040138c <+0>: push esi
0x0040138d <+1>: push ebx
0x0040138e <+2>: sub esp,0x14
0x00401391 <+5>: mov esi,DWORD PTR [esp+0x20] ; (1)
0x00401395 <+9>: test esi,esi
0x00401397 <+11>: je 0x4013b1 <loop+37>
0x00401399 <+13>: xor ebx,ebx ; (2)
0x0040139b <+15>: nop
0x0040139c <+16>: mov DWORD PTR [esp+0x4],ebx
0x004013a0 <+20>: mov DWORD PTR [esp],0x403064
0x004013a7 <+27>: call 0x401b0c <printf>
0x004013ac <+32>: inc ebx
0x004013ad <+33>: cmp ebx,esi
0x004013af <+35>: jne 0x40139c <loop+16>
0x004013b1 <+37>: add esp,0x14
0x004013b4 <+40>: pop ebx
0x004013b5 <+41>: pop esi
0x004013b6 <+42>: ret
0x004013b7 <+43>: nop
z will endup in some unused register esi, registers are the fastest storage classed.
There is no local variable i, on the stack, the compiler used ebx to store the value of i, also register.
After all, it depends on the compiler and the optimization options you use, but, in all cases, C still faster, much faster, than VB.
It depends on your compiler, but I believe most of the contemporary compilers will optimize that for you if the value of SlowVariable is constant.
"It" (the language) doesn't say. It must behave as if the variable is evaluated every time, of course.
An optimizing compiler can do a lot of clever things, so it's always best to leave these sorts of micro-optimizations to the compiler.
If you're going down the optimization by hand route, be sure to profile (=measure) and read the generated code.
Actually it depends on "SlowVariable" and on the behavior of your compiler. If your slow variable is e.g. volatile the compiler won't do any effort to cache it as the keyword volatile won't permit it. If it's not "volatile" there is a good chance that the compiler optimizes consecutive accesses to this variable by loading it once into the register.
Related
I have this C code that is disassembled (AT&T) and I have some confusion with two things. The first, my understanding is that EBP-4 should be the first local variable (here, int i) on the stack. I is clearly in EBP-8 here. Why is this?
Second, is it necessary to move values into registers before performing arithmetic operations on them? (this is an x86 32 bit machine) example:
0x08048402 <+21>: mov 0x8(%ebp),%eax //move parameter a into eax
0x08048405 <+24>: add %eax,-0x4(%ebp) //r += a
Why cant this be :
0x08048405 <+24>: add 0x8(%ebp),-0x4(%ebp) //r += a
C code:
int loop_w (int a, int b){
int i = 0;
int r = a;
while ( i < 256){
r += a;
a -= b;
i += b;
}
return r;
Disassembly:
Dump of assembler code for function loop_w:
0x080483ed <+0>: push %ebp
0x080483ee <+1>: mov %esp,%ebp
0x080483f0 <+3>: sub $0x10,%esp
---------------Above is for stack setup-----------------
0x080483f3 <+6>: movl $0x0,-0x8(%ebp) //I=0
0x080483fa <+13>: mov 0x8(%ebp),%eax //move parameter a into eax
0x080483fd <+16>: mov %eax,-0x4(%ebp) //move a into local var (r=a)
0x08048400 <+19>: jmp 0x8048414 <loop_w+39> //start while loop
0x08048402 <+21>: mov 0x8(%ebp),%eax //move parameter a into eax
0x08048405 <+24>: add %eax,-0x4(%ebp) //r += a
0x08048408 <+27>: mov 0xc(%ebp),%eax //move parameter b into eax
0x0804840b <+30>: sub %eax,0x8(%ebp) //a += parameter b
0x0804840e <+33>: mov 0xc(%ebp),%eax //move parameter b into eax
0x08048411 <+36>: add %eax,-0x8(%ebp) //i+=b
0x08048414 <+39>: cmpl $0xff,-0x8(%ebp) //compare i to 256
0x0804841b <+46>: jle 0x8048402 <loop_w+21> //continue loop if failed condition
=> 0x0804841d <+48>: mov -0x4(%ebp),%eax //move r into eax
0x08048420 <+51>: leave
0x08048421 <+52>: ret //return eax
End of assembler dump.
The compiler can decide wherever it wants to place the local variable. It might have placed it at %ebp - 8 for alignment reasons.
For your second question - Do variables have to be loaded into registers before operation, depends on the operation and the instruction set provided by the architecture.
You mentioned x86. So particular to this architecture, X86 doesn't allow instructions with two memory operands (yes there are a few exceptions).
You can search per instruction basis to know what kind of operands they allow.
I am reading a chapter about assembly code, which has an example. Here is the C program:
int main()
{
int i;
for(i=0; i < 10; i++)
{
puts("Hello, world!\n");
}
return 0;
}
Here is the assembly code provided in the book:
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
Here is part of my version:
0x0000000000400538 <+8>: mov DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>: jmp 0x40054f <main+31>
0x0000000000400541 <+17>: mov edi,0x4005f0
0x0000000000400546 <+22>: call 0x400410 <puts#plt>
0x000000000040054b <+27>: add DWORD PTR [rbp-0x4],0x1
0x000000000040054f <+31>: cmp DWORD PTR [rbp-0x4],0x9
0x0000000000400553 <+35>: jle 0x400541 <main+17>
My question is, why is in case of the book's version it assigns 0 to the variable(mov DWORD PTR [ebp-4],0x0) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?
It seems more logical to assign and compare without any jump, because it is like that inside for loop.
Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.
Let's reason about what the for loop does.
for (i = 0; i < 10; i++) {
code;
}
Let's write it a little bit closer to the assembler that was generated by the first compiler generated.
i = 0;
start: if (i > 9) goto out;
code;
i++;
goto start;
out:
Now the same thing for "my version":
i = 0;
goto cmp;
start: code;
i++;
cmp: if (i < 10) goto start;
The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.
Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.
You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):
# ... establish stack frame ...
mov DWORD PTR [rbp-4],0x0
cmp DWORD PTR [rbp-4],0x9
jle .L0
.L1:
mov rdi, .Lconst0
call puts
add DWORD PTR [rbp-0x4],0x1
cmp DWORD PTR [rbp-0x4],0x9
jle .L1
.L0:
# ... return from function ...
GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion. Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:
# ... establish stack frame ...
mov ebx, 10
mov r14, .Lconst0
.L1:
mov rdi, r14
call puts
dec ebx
jne .L1
# ... return from function ...
(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain.)
When I run the following C code, I get different output depending on whether or not the code was run with optimization turned on (gcc -O) or not.
#include <stdio.h>
int main()
{
int b = 55;
int a[2] = {4, 5};
int index;
printf(" index a[index]\n ");
printf("==================\n ");
for(index = 0; index < 6; index++)
{
printf("%2d%12d\n", index, a[index]);
}
return 0;
}
I understand that accessing an index out-of-bounds in C will simply access the stack memory further down from the array (assuming there is enough stack space allocated for that index, otherwise it segfaults) because arrays are just pointers in C. But how does the optimization affect this?
Accessing out-of-bounds is undefined behavior. So the compiler is allowed to do anything it wants and anything is allowed to happen. So there isn't much of a point in trying to "guess" what will happen.
In your case, optimization is probably affecting the ordering and contents of the stack beyond the array. This would give you the varying results.
You're causing undefined behaviour, plain and simple. You can't really say "why does this undefined behaviour cause this result under this circumstance, but a different result under another?"
Use objdump or gdb if you want to see what instructions are causing the stack to be different under optimization.
EDIT: For example, when compiling with the -O flag, there's quite a few differences to the stack at the beginning of main alone before the first printf (compiled as 32-bit for clarity):
Unoptimized:
0x080483c4 <+0>: push ebp
0x080483c5 <+1>: mov ebp,esp
0x080483c7 <+3>: and esp,0xfffffff0
0x080483ca <+6>: sub esp,0x20
0x080483cd <+9>: mov DWORD PTR [esp+0x18],0x37
0x080483d5 <+17>: mov DWORD PTR [esp+0x10],0x4
0x080483dd <+25>: mov DWORD PTR [esp+0x14],0x5
0x080483e5 <+33>: mov eax,0x8048514
0x080483ea <+38>: mov DWORD PTR [esp],eax
0x080483ed <+41>: call 0x80482e0 <printf#plt>
Optimized:
0x080483c4 <+0>: push ebp
0x080483c5 <+1>: mov ebp,esp
0x080483c7 <+3>: push ebx
0x080483c8 <+4>: and esp,0xfffffff0
0x080483cb <+7>: sub esp,0x20
0x080483ce <+10>: mov DWORD PTR [esp+0x18],0x4
0x080483d6 <+18>: mov DWORD PTR [esp+0x1c],0x5
0x080483de <+26>: mov DWORD PTR [esp],0x8048504
0x080483e5 <+33>: call 0x80482e0 <printf#plt>
Why the array index is faster than pointer?
Isn't pointer supposed to be faster than array index?
** i used time.h clock_t to tested two functions, each loop 2 million times.
Pointer time : 0.018995
Index time : 0.017864
void myPointer(int a[], int size)
{
int *p;
for(p = a; p < &a[size]; p++)
{
*p = 0;
}
}
void myIndex(int a[], int size)
{
int i;
for(i = 0; i < size; i++)
{
a[i] = 0;
}
}
No, never ever pointers are supposed to be faster than array index. If one of the code is faster than the other, it's mostly because some address computations might be different. The question also should provide information of compiler and optimization flags as it can heavily affect the performance.
Array index in your context (array bound is not known) is exactly identical to the pointer operation. From a viewpoint of compilers, it is just different expression of pointer arithmetic. Here is an example of an optimized x86 code in Visual Studio 2010 with full optimization and no inline.
3: void myPointer(int a[], int size)
4: {
013E1800 push edi
013E1801 mov edi,ecx
5: int *p;
6: for(p = a; p < &a[size]; p++)
013E1803 lea ecx,[edi+eax*4]
013E1806 cmp edi,ecx
013E1808 jae myPointer+15h (13E1815h)
013E180A sub ecx,edi
013E180C dec ecx
013E180D shr ecx,2
013E1810 inc ecx
013E1811 xor eax,eax
013E1813 rep stos dword ptr es:[edi]
013E1815 pop edi
7: {
8: *p = 0;
9: }
10: }
013E1816 ret
13: void myIndex(int a[], int size)
14: {
15: int i;
16: for(i = 0; i < size; i++)
013E17F0 test ecx,ecx
013E17F2 jle myIndex+0Ch (13E17FCh)
013E17F4 push edi
013E17F5 xor eax,eax
013E17F7 mov edi,edx
013E17F9 rep stos dword ptr es:[edi]
013E17FB pop edi
17: {
18: a[i] = 0;
19: }
20: }
013E17FC ret
At a glance, myIndex looks faster because the number of instructions are less, however, the two pieces of the code are essentially the same. Both eventually use rep stos, which is a x86's repeating (loop) instruction. The only difference is because of the computation of the loop bound. The for loop in myIndex has the trip count size as it is (i.e., no computation is needed). But, myPointer needs some computation to get the trip count of the for loop. This is the only difference. The important loop operations are just the same. Thus, the difference is negligible.
To summarize, the performance of myPointer and myIndex in an optimized code should be identical.
FYI, if the array's bound is known at compile time, e.g., int A[constant_expression], then the accesses on this array may be much faster than the pointer one. This is mostly because the array accesses are free from the pointer analysis problem. Compilers can perfectly compute the dependency information on computations and accesses on a fixed-size array, so it can do advanced optimizations including automatic parallelization.
However, if computations are pointer based, compilers must perform pointer analysis for further optimization, which is pretty much limited in C/C++. It generally ends up with conservative results on pointer analysis and results in a few optimization opportunity.
Array dereference p[i] is *(p + i). Compilers make use of instructions that do math + dereference in 1 or 2 cycles (e.g. x86 LEA instruction) to optimize for speed.
With the pointer loop, it splits the access and offset into to separate parts and the compiler cannot optimize it.
It may be the comparison in the for loop that is causing the difference. The termination condition is tested on each iteration, and your "pointer" example has a slightly more complicated termination condition (taking the address of &a[size]). Since &a[size] does not change, you could try setting it to a variable to avoid recalculating it on each iteration of the loop.
I would suggest running each loop 200 million times, and then run each loop 10 times, and take the fastest measurement. That will factor out effects from OS scheduling and so on.
I would then suggest you disassemble the code for each loop.
Oops, on my 64-bit system results are quite different. I've got that this
int i;
for(i = 0; i < size; i++)
{
*(a+i) = 0;
}
is about 100 times !! slower than this
int i;
int * p = a;
for(i = 0; i < size; i++)
{
*(p++) = 0;
}
when compiling with -O3. This hints to me that somehow moving to next address is far easier to achieve for 64-bit cpu, than to calculate destination address from some offset. But i'm not sure.
EDIT: This really has something related with 64-bit architecture because same code with same compile flags doesn't shows any real difference in performance on 32-bit system.
Compiler optimizations are pattern matching.
When your compiler optimizes, it looks for known code patterns, and then transforms the code according to some rule. Your two code snippets seem to trigger different transforms, and thus produce slightly different code.
This is one of the reasons why we always insist to actually measure the resulting performance when it comes to optimizations: You can never be sure what your compiler turns your code into unless you test it.
If you are really curious, try compiling your code with gcc -S -Os, this produces the most readable, yet optimized assembler code. On your two functions, I get the following assembler with that:
pointer code:
.L2:
cmpq %rax, %rdi
jnb .L5
movl $0, (%rdi)
addq $4, %rdi
jmp .L2
.L5:
index code:
.L7:
cmpl %eax, %esi
jle .L9
movl $0, (%rdi,%rax,4)
incq %rax
jmp .L7
.L9:
The differences are slight, but may indeed trigger a performance difference, most importantly the difference between using addq and incq could be significant.
The times are so close together that if you did them repeatedly, you may not see much of a difference. Both code segments compile to the exact same assembly. By definition, there is no difference.
It looks like the index solution can save a few instructions with the compare in the for loop.
Access the data through array index or pointer is exactly equivalent. Go through the below program with me...
There are a loop which continues to 100 times but when we see disassemble code that there are the data which we access through has least instruction comparability to access through array Index
But it doesn't mean that accessing data through pointer is fast actually it's depend on the instruction which performed by compiler.Both the pointer and array index used the address array access the value from offset and increment through it and pointer has address.
int a[100];
fun1(a,100);
fun2(&a[0],5);
}
void fun1(int a[],int n)
{
int i;
for(i=0;i<=99;i++)
{
a[i]=0;
printf("%d\n",a[i]);
}
}
void fun2(int *p,int n)
{
int i;
for(i=0;i<=99;i++)
{
*p=0;
printf("%d\n",*(p+i));
}
}
disass fun1
Dump of assembler code for function fun1:
0x0804841a <+0>: push %ebp
0x0804841b <+1>: mov %esp,%ebp
0x0804841d <+3>: sub $0x28,%esp`enter code here`
0x08048420 <+6>: movl $0x0,-0xc(%ebp)
0x08048427 <+13>: jmp 0x8048458 <fun1+62>
0x08048429 <+15>: mov -0xc(%ebp),%eax
0x0804842c <+18>: shl $0x2,%eax
0x0804842f <+21>: add 0x8(%ebp),%eax
0x08048432 <+24>: movl $0x0,(%eax)
0x08048438 <+30>: mov -0xc(%ebp),%eax
0x0804843b <+33>: shl $0x2,%eax
0x0804843e <+36>: add 0x8(%ebp),%eax
0x08048441 <+39>: mov (%eax),%edx
0x08048443 <+41>: mov $0x8048570,%eax
0x08048448 <+46>: mov %edx,0x4(%esp)
0x0804844c <+50>: mov %eax,(%esp)
0x0804844f <+53>: call 0x8048300 <printf#plt>
0x08048454 <+58>: addl $0x1,-0xc(%ebp)
0x08048458 <+62>: cmpl $0x63,-0xc(%ebp)
0x0804845c <+66>: jle 0x8048429 <fun1+15>
0x0804845e <+68>: leave
0x0804845f <+69>: ret
End of assembler dump.
(gdb) disass fun2
Dump of assembler code for function fun2:
0x08048460 <+0>: push %ebp
0x08048461 <+1>: mov %esp,%ebp
0x08048463 <+3>: sub $0x28,%esp
0x08048466 <+6>: movl $0x0,-0xc(%ebp)
0x0804846d <+13>: jmp 0x8048498 <fun2+56>
0x0804846f <+15>: mov 0x8(%ebp),%eax
0x08048472 <+18>: movl $0x0,(%eax)
0x08048478 <+24>: mov -0xc(%ebp),%eax
0x0804847b <+27>: shl $0x2,%eax
0x0804847e <+30>: add 0x8(%ebp),%eax
0x08048481 <+33>: mov (%eax),%edx
0x08048483 <+35>: mov $0x8048570,%eax
0x08048488 <+40>: mov %edx,0x4(%esp)
0x0804848c <+44>: mov %eax,(%esp)
0x0804848f <+47>: call 0x8048300 <printf#plt>
0x08048494 <+52>: addl $0x1,-0xc(%ebp)
0x08048498 <+56>: cmpl $0x63,-0xc(%ebp)
0x0804849c <+60>: jle 0x804846f <fun2+15>
0x0804849e <+62>: leave
0x0804849f <+63>: ret
End of assembler dump.
(gdb)
This is a very hard thing to time, because compilers are very good at optimising these things. Still it's better to give the compiler as much information as possible, that's why in this case I'd advise using std::fill, and let the compiler choose.
But... If you want to get into the detail
a) CPU's normally give pointer+value for free, like : mov r1, r2(r3).
b) This means an index operation requires just : mul r3,r1,size
This is just one cycle extra, per loop.
c) CPU's often provide stall/delay slots, meaning you can often hide single-cycle operations.
All in all, even if your loops are very large, the cost of the access is nothing compared to the cost of even a few cache-misses. You are best advised to optimise your structures before you care about loop costs. Try for example, packing your structures to reduce the memory footprint first
Say I have a loop that looks like this:
for(int i = 0; i < 10000; i++) {
/* Do something computationally expensive */
if (i < 200 && !(i%20)) {
/* Do something else */
}
}
wherein some trivial task gets stuck behind an if-statement that only runs a handful of times.
I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:
for(int i = 0; i < 200; i++) {
/* Do something computationally expensive */
if (!(i%20)) {
/* Do something else */
}
}
for(int i = 200; i < 10000; i++) {
/* Do something computationally expensive */
}
Will gcc (with the appropriate flags, like -O3) automatically break the one loop into two, or does it only unroll to decrease the number of iterations?
Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:
int main() {
int sum = 0;
int i;
for(i = 0; i < 10000; i++) {
if (i < 200 && !(i%20)) {
sum += 0xC0DE;
}
sum += 0xCAFE;
}
printf("%d\n", sum);
return 0;
}
and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:
0x08048404 <main+20>: xor ebx,ebx
0x08048406 <main+22>: push ecx
0x08048407 <main+23>: xor ecx,ecx
0x08048409 <main+25>: sub esp,0xc
0x0804840c <main+28>: lea esi,[esi+eiz*1+0x0]
0x08048410 <main+32>: cmp ecx,0xc7
0x08048416 <main+38>: jg 0x8048436 <main+70>
0x08048418 <main+40>: mov eax,ecx
0x0804841a <main+42>: imul esi
0x0804841c <main+44>: mov eax,ecx
0x0804841e <main+46>: sar eax,0x1f
0x08048421 <main+49>: sar edx,0x3
0x08048424 <main+52>: sub edx,eax
0x08048426 <main+54>: lea edx,[edx+edx*4]
0x08048429 <main+57>: shl edx,0x2
0x0804842c <main+60>: cmp ecx,edx
0x0804842e <main+62>: jne 0x8048436 <main+70>
0x08048430 <main+64>: add ebx,0xc0de
0x08048436 <main+70>: add ecx,0x1
0x08048439 <main+73>: add ebx,0xcafe
0x0804843f <main+79>: cmp ecx,0x2710
0x08048445 <main+85>: jne 0x8048410 <main+32>
0x08048447 <main+87>: mov DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>: mov DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>: mov DWORD PTR [esp],0x1
0x0804845a <main+106>: call 0x8048308 <__printf_chk#plt>
So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.
But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.
Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?