Does gcc automatically "unroll" if-statements? - c

Say I have a loop that looks like this:
for(int i = 0; i < 10000; i++) {
/* Do something computationally expensive */
if (i < 200 && !(i%20)) {
/* Do something else */
}
}
wherein some trivial task gets stuck behind an if-statement that only runs a handful of times.
I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:
for(int i = 0; i < 200; i++) {
/* Do something computationally expensive */
if (!(i%20)) {
/* Do something else */
}
}
for(int i = 200; i < 10000; i++) {
/* Do something computationally expensive */
}
Will gcc (with the appropriate flags, like -O3) automatically break the one loop into two, or does it only unroll to decrease the number of iterations?

Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:
int main() {
int sum = 0;
int i;
for(i = 0; i < 10000; i++) {
if (i < 200 && !(i%20)) {
sum += 0xC0DE;
}
sum += 0xCAFE;
}
printf("%d\n", sum);
return 0;
}
and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:
0x08048404 <main+20>: xor ebx,ebx
0x08048406 <main+22>: push ecx
0x08048407 <main+23>: xor ecx,ecx
0x08048409 <main+25>: sub esp,0xc
0x0804840c <main+28>: lea esi,[esi+eiz*1+0x0]
0x08048410 <main+32>: cmp ecx,0xc7
0x08048416 <main+38>: jg 0x8048436 <main+70>
0x08048418 <main+40>: mov eax,ecx
0x0804841a <main+42>: imul esi
0x0804841c <main+44>: mov eax,ecx
0x0804841e <main+46>: sar eax,0x1f
0x08048421 <main+49>: sar edx,0x3
0x08048424 <main+52>: sub edx,eax
0x08048426 <main+54>: lea edx,[edx+edx*4]
0x08048429 <main+57>: shl edx,0x2
0x0804842c <main+60>: cmp ecx,edx
0x0804842e <main+62>: jne 0x8048436 <main+70>
0x08048430 <main+64>: add ebx,0xc0de
0x08048436 <main+70>: add ecx,0x1
0x08048439 <main+73>: add ebx,0xcafe
0x0804843f <main+79>: cmp ecx,0x2710
0x08048445 <main+85>: jne 0x8048410 <main+32>
0x08048447 <main+87>: mov DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>: mov DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>: mov DWORD PTR [esp],0x1
0x0804845a <main+106>: call 0x8048308 <__printf_chk#plt>
So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.
But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.
Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?

Related

F# Breakable Array Iteration With Bounds Checking Elided?

I am interested in trying F# in a high-performance application. I do not want to have a large array's bounds checked during iteration and the lack of break/return statements is concerning.
This is a contrived example that will break upon finding a value, but can someone tell me if bounds checking is also elided?
let innerExists (item: Char) (items: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < items.Length do
state <- item = items.[i]
i <- i + 1
state
let exists (input: Char array)(illegalChars: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < input.Length do
state <- innerExists input.[i] illegalChars
i <- i + 1
state
exists [|'A'..'z'|] [|'.';',';';'|]
Here is the relevant disassembly:
while not state && i < input.Length do
000007FE6EB4237A cmp dword ptr [rbp-14h],0
000007FE6EB4237E jne 000007FE6EB42383
000007FE6EB42380 nop
000007FE6EB42381 jmp 000007FE6EB42386
000007FE6EB42383 nop
000007FE6EB42384 jmp 000007FE6EB423A9
000007FE6EB42386 nop
000007FE6EB42387 mov r8d,dword ptr [rbp-18h]
000007FE6EB4238B mov rdx,qword ptr [rbp+18h]
000007FE6EB4238F cmp r8d,dword ptr [rdx+8]
000007FE6EB42393 setl r8b
000007FE6EB42397 movzx r8d,r8b
000007FE6EB4239B mov dword ptr [rbp-24h],r8d
000007FE6EB4239F mov r8d,dword ptr [rbp-24h]
000007FE6EB423A3 mov dword ptr [rbp-1Ch],r8d
000007FE6EB423A7 jmp 000007FE6EB423B1
000007FE6EB423A9 nop
000007FE6EB423AA xor r8d,r8d
000007FE6EB423AD mov dword ptr [rbp-1Ch],r8d
000007FE6EB423B1 cmp dword ptr [rbp-1Ch],0
000007FE6EB423B5 je 000007FE6EB42409
state <- innerExists input.[i] illegalChars
000007FE6EB423B7 mov r8d,dword ptr [rbp-18h]
000007FE6EB423BB mov rdx,qword ptr [rbp+18h]
000007FE6EB423BF cmp r8,qword ptr [rdx+8]
000007FE6EB423C3 jb 000007FE6EB423CA
000007FE6EB423C5 call 000007FECD796850
000007FE6EB423CA lea rdx,[rdx+r8*2+10h]
000007FE6EB423CF movzx r8d,word ptr [rdx]
000007FE6EB423D3 mov rdx,qword ptr [rbp+10h]
000007FE6EB423D7 mov rdx,qword ptr [rdx+8]
000007FE6EB423DB mov r9,qword ptr [rbp+20h]
000007FE6EB423DF mov rcx,7FE6EEE0640h
000007FE6EB423E9 call 000007FE6EB41E40
000007FE6EB423EE mov dword ptr [rbp-20h],eax
000007FE6EB423F1 mov eax,dword ptr [rbp-20h]
000007FE6EB423F4 movzx eax,al
000007FE6EB423F7 mov dword ptr [rbp-14h],eax
i <- i + 1
000007FE6EB423FA mov eax,dword ptr [rbp-18h]
Others pointed out to use existing function FSharp.Core to achieve the same result but I think that OP asks if in loops like the boundary check of arrays are elided (as it is checked in the loop condition).
For simple code like above the jitter should be able to elide the checks. To see this it is correct to check the assembly code but it is important to not run with the VS debugger attached as the jitter don't optimize the code then. The reason that it can be impossible to show correct values in the debugger.
First let's look at exists optimized x64:
; not state?
00007ff9`1cd37551 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd37553 7521 jne 00007ff9`1cd37576
; i < input.Length?
00007ff9`1cd37555 395e08 cmp dword ptr [rsi+8],ebx
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd37558 0f9fc1 setg cl
00007ff9`1cd3755b 0fb6c9 movzx ecx,cl
00007ff9`1cd3755e 85c9 test ecx,ecx
; if we have reached end of the array then exit
00007ff9`1cd37560 7414 je 00007ff9`1cd37576
; mov i in ebx to rcx, unnecessary but moves like these are very cheap
00007ff9`1cd37562 4863cb movsxd rcx,ebx
; input.[i] (note we don't check the boundary again)
00007ff9`1cd37565 0fb74c4e10 movzx ecx,word ptr [rsi+rcx*2+10h]
; move illegalChars pointer to rdx
00007ff9`1cd3756a 488bd7 mov rdx,rdi
; call innerExists
00007ff9`1cd3756d e8ee9affff call 00007ff9`1cd31060
; i <- i + 1
00007ff9`1cd37572 ffc3 inc ebx
; Jump top of loop
00007ff9`1cd37574 ebdb jmp 00007ff9`1cd37551
; We are done!
00007ff9`1cd37576
So the code looks a bit too complex for what it should need to be but it seems it only checks the array condition once.
Now let's look at innerExists optimized x64:
# let mutable state = false
00007ff9`1cd375a0 33c0 xor eax,eax
# let mutable i = 0
00007ff9`1cd375a2 4533c0 xor r8d,r8d
; not state?
00007ff9`1cd375a5 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd375a7 752b jne 00007ff9`1cd375d4
; i < items.Length
00007ff9`1cd375a9 44394208 cmp dword ptr [rdx+8],r8d
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd375ad 410f9fc1 setg r9b
00007ff9`1cd375b1 450fb6c9 movzx r9d,r9b
00007ff9`1cd375b5 4585c9 test r9d,r9d
; if we have reached end of the array then exit
00007ff9`1cd375b8 741a je 00007ff9`1cd375d4
; mov i in r8d to rax, unnecessary but moves like these are very cheap
00007ff9`1cd375ba 4963c0 movsxd rax,r8d
; items.[i] (note we don't check the boundary again)
00007ff9`1cd375bd 0fb7444210 movzx eax,word ptr [rdx+rax*2+10h]
; mov item in cx to r9d, unnecessary but moves like these are very cheap
00007ff9`1cd375c2 440fb7c9 movzx r9d,cx
; item = items.[i]?
00007ff9`1cd375c6 413bc1 cmp eax,r9d
00007ff9`1cd375c9 0f94c0 sete al
; state <- ?
00007ff9`1cd375cc 0fb6c0 movzx eax,al
; i <- i + 1
00007ff9`1cd375cf 41ffc0 inc r8d
; Jump top of loop
00007ff9`1cd375d2 ebd1 jmp 00007ff9`1cd375a5
; We are done!
00007ff9`1cd375d4 c3 ret
So looks overly complex for what it should be but at least it looks like it only checks the array condition once.
So finally, it looks like the jitter eliminates the array boundary checks because it can prove this has already been checked successfully in the loop condition which I believe is what the OP wondered.
The x64 code doesn't look as clean as it could but from my experimentation x64 code that is cleaned up doesn't perform that much better, I suspect the CPU vendors optimize the CPU for the crappy code jitters produce.
An interesting exercise would be to code up an equivalent program in C++ and run it through https://godbolt.org/, choose x86-64 gcc (trunk) (gcc seems to do best right now) and specify the options -O3 -march=native and see the resulting x64 code.
Update
The code rewritten in https://godbolt.org/ to allow us seeing the assembly code generated by a c++ compiler:
template<int N>
bool innerExists(char item, char const (&items)[N]) {
for (auto i = 0; i < N; ++i) {
if (item == items[i]) return true;
}
return false;
}
template<int N1, int N2>
bool exists(char const (&input)[N1], char const (&illegalCharacters)[N2]) {
for (auto i = 0; i < N1; ++i) {
if (innerExists(input[i], illegalCharacters)) return true;
}
return false;
}
char const separators[] = { '.', ',', ';' };
char const str[58] = { };
bool test() {
return exists(str, separators);
}
x86-64 gcc (trunk) with options -O3 -march=native the following code is generated
; Load the string to test into edx
mov edx, OFFSET FLAT:str+1
.L2:
; Have we reached the end?
cmp rdx, OFFSET FLAT:str+58
; If yes, then jump to the end
je .L7
; Load a character
movzx ecx, BYTE PTR [rdx]
; Comparing the 3 separators are encoded in the assembler
; because the compiler detected the array is always the same
mov eax, ecx
and eax, -3
cmp al, 44
sete al
cmp cl, 59
sete cl
; increase outer i
inc rdx
; Did we find a match?
or al, cl
; If no then loop to .L2
je .L2
; We are done!
ret
.L7:
; No match found, clear result
xor eax, eax
; We are done!
ret
Looks pretty good but what I missing in the code above is using AVX to test multiple characters at once.
Bound check are eliminated by the JIT compiler, so it works the same for F# as for C#. You can expect elimination for code as in your example, as well as for
for i = 0 to data.Lenght - 1 do
...
and also for tail recursive functions, which compiles down to loops.
The built in Array.contains and Array.exists (source code) are written so that the JIT compiler can eliminate bounds checks.
What's wrong with the Array.contains and Array.exists functions ?
let exists input illegalChars =
input |> Array.exists (fun c -> illegalChars |> Array.contains c)

Difference in for loops of old and new GCC's generated assembly code

I am reading a chapter about assembly code, which has an example. Here is the C program:
int main()
{
int i;
for(i=0; i < 10; i++)
{
puts("Hello, world!\n");
}
return 0;
}
Here is the assembly code provided in the book:
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
Here is part of my version:
0x0000000000400538 <+8>: mov DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>: jmp 0x40054f <main+31>
0x0000000000400541 <+17>: mov edi,0x4005f0
0x0000000000400546 <+22>: call 0x400410 <puts#plt>
0x000000000040054b <+27>: add DWORD PTR [rbp-0x4],0x1
0x000000000040054f <+31>: cmp DWORD PTR [rbp-0x4],0x9
0x0000000000400553 <+35>: jle 0x400541 <main+17>
My question is, why is in case of the book's version it assigns 0 to the variable(mov DWORD PTR [ebp-4],0x0) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?
It seems more logical to assign and compare without any jump, because it is like that inside for loop.
Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.
Let's reason about what the for loop does.
for (i = 0; i < 10; i++) {
code;
}
Let's write it a little bit closer to the assembler that was generated by the first compiler generated.
i = 0;
start: if (i > 9) goto out;
code;
i++;
goto start;
out:
Now the same thing for "my version":
i = 0;
goto cmp;
start: code;
i++;
cmp: if (i < 10) goto start;
The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.
Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.
You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):
# ... establish stack frame ...
mov DWORD PTR [rbp-4],0x0
cmp DWORD PTR [rbp-4],0x9
jle .L0
.L1:
mov rdi, .Lconst0
call puts
add DWORD PTR [rbp-0x4],0x1
cmp DWORD PTR [rbp-0x4],0x9
jle .L1
.L0:
# ... return from function ...
GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion. Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:
# ... establish stack frame ...
mov ebx, 10
mov r14, .Lconst0
.L1:
mov rdi, r14
call puts
dec ebx
jne .L1
# ... return from function ...
(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain.)

Does GCC cache loop variables?

When I have a loop like:
for (int i = 0; i < SlowVariable; i++)
{
//
}
I know that in VB6 the SlowVariable is accessed every iteration of the loop, making the following much more efficient:
int cnt = SlowVariable;
for (int i = 0; i < cnt; i++)
{
//
}
Do I need the make the same optimizations in GCC? Or does it evaluate SlowVariable only once?
This is called "hoisting" SlowVariabele out of the loop.
The compiler can do it only if it can prove that the value of SlowVariabele is the same every time, and that evaluating SlowVariabele has no side-effects.
So for example consider the following code (I assume for the sake of example that accessing through a pointer is "slow" for some reason):
void foo1(int *SlowVariabele, int *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The compiler cannot (in general) hoist, because for all it knows it will be called with result == SlowVariabele, and so the value of *SlowVariabele is changing during the loop.
On the other hand:
void foo2(int *result) {
int val = 12;
int *SlowVariabele = &val;
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
Now at least in principle, the compiler can know that val never changes in the loop, and so it can hoist. Whether it actually does so is a matter of how aggressive the optimizer is and how good its analysis of the function is, but I'd expect any serious compiler to be capable of it.
Similarly, if foo1 was called with pointers that the compiler can determine (at the call site) are non-equal, and if the call is inlined, then the compiler could hoist. That's what restrict is for:
void foo3(int *restrict SlowVariabele, int *restrict result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
restrict (introduced in C99) means "you must not call this function with result == SlowVariabele", and allows the compiler to hoist.
Similarly:
void foo4(int *SlowVariabele, float *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The strict aliasing rules mean that SlowVariable and result must not refer to the same location (or the program has undefined behaviour anyway), and so again the compiler can hoist.
Generally, variables can't be slow (or fast) unless they are mapped to some weird kind of memory (you usually want to declare them volatile in this case).
But indeed, using a local variable creates more opportunities for optimization, and the effect may be very visible. The compiler can "cache" a global variable by itself only if it's able to prove that no function called within a loop can read or write that global variable. When you call an external function within a loop, the compiler probably won't be able to prove such a thing.
This depends on how the compiler optimize, for example here:
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned int i;
unsigned int z = 10;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
return 0;
}
If you compiled it using gcc example.c -o example, the result code will be:
0x0040138c <+0>: push ebp
0x0040138d <+1>: mov ebp,esp
0x0040138f <+3>: and esp,0xfffffff0
0x00401392 <+6>: sub esp,0x20
0x00401395 <+9>: call 0x4018f4 <__main>
0x0040139a <+14>: mov DWORD PTR [esp+0x18],0xa
0x004013a2 <+22>: mov DWORD PTR [esp+0x1c],0x0
0x004013aa <+30>: jmp 0x4013c4 <main+56>
0x004013ac <+32>: mov eax,DWORD PTR [esp+0x1c]
0x004013b0 <+36>: mov DWORD PTR [esp+0x4],eax
0x004013b4 <+40>: mov DWORD PTR [esp],0x403064
0x004013bb <+47>: call 0x401b2c <printf>
0x004013c0 <+52>: inc DWORD PTR [esp+0x1c]
0x004013c4 <+56>: mov eax,DWORD PTR [esp+0x1c] ; (1)
0x004013c8 <+60>: cmp eax,DWORD PTR [esp+0x18] ; (2)
0x004013cc <+64>: jb 0x4013ac <main+32>
0x004013ce <+66>: mov eax,0x0
0x004013d3 <+71>: leave
0x004013d4 <+72>: ret
0x004013d5 <+73>: nop
0x004013d6 <+74>: nop
0x004013d7 <+75>: nop
The value of i will be movied from the stack into eax.
Then the CPU will compare eax or i, with the value of z, which is in the stack.
All of this happen on every round.
If you optimized the code using gcc -O2 example.c -o example, the result will be:
0x00401b70 <+0>: push ebp
0x00401b71 <+1>: mov ebp,esp
0x00401b73 <+3>: push ebx
0x00401b74 <+4>: and esp,0xfffffff0
0x00401b77 <+7>: sub esp,0x10
0x00401b7a <+10>: call 0x4018a8 <__main>
0x00401b7f <+15>: xor ebx,ebx
0x00401b81 <+17>: lea esi,[esi+0x0]
0x00401b84 <+20>: mov DWORD PTR [esp+0x4],ebx
0x00401b88 <+24>: mov DWORD PTR [esp],0x403064
0x00401b8f <+31>: call 0x401ae0 <printf>
0x00401b94 <+36>: inc ebx
0x00401b95 <+37>: cmp ebx,0xa ; (1)
0x00401b98 <+40>: jne 0x401b84 <main+20>
0x00401b9a <+42>: xor eax,eax
0x00401b9c <+44>: mov ebx,DWORD PTR [ebp-0x4]
0x00401b9f <+47>: leave
0x00401ba0 <+48>: ret
0x00401ba1 <+49>: nop
0x00401ba2 <+50>: nop
0x00401ba3 <+51>: nop
The compiler knows that there is no point of checking the value of z, so it modifies the code to something like for( i = 0 ; i < 10 ; i++ ).
In case the compiler doesn't konw the value of z like in this code:
#include <stdio.h>
void loop(unsigned int z) {
unsigned int i;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
}
int main(int argc, char **argv)
{
unsigned int z = 10;
loop(z);
return 0;
}
The result will be:
0x0040138c <+0>: push esi
0x0040138d <+1>: push ebx
0x0040138e <+2>: sub esp,0x14
0x00401391 <+5>: mov esi,DWORD PTR [esp+0x20] ; (1)
0x00401395 <+9>: test esi,esi
0x00401397 <+11>: je 0x4013b1 <loop+37>
0x00401399 <+13>: xor ebx,ebx ; (2)
0x0040139b <+15>: nop
0x0040139c <+16>: mov DWORD PTR [esp+0x4],ebx
0x004013a0 <+20>: mov DWORD PTR [esp],0x403064
0x004013a7 <+27>: call 0x401b0c <printf>
0x004013ac <+32>: inc ebx
0x004013ad <+33>: cmp ebx,esi
0x004013af <+35>: jne 0x40139c <loop+16>
0x004013b1 <+37>: add esp,0x14
0x004013b4 <+40>: pop ebx
0x004013b5 <+41>: pop esi
0x004013b6 <+42>: ret
0x004013b7 <+43>: nop
z will endup in some unused register esi, registers are the fastest storage classed.
There is no local variable i, on the stack, the compiler used ebx to store the value of i, also register.
After all, it depends on the compiler and the optimization options you use, but, in all cases, C still faster, much faster, than VB.
It depends on your compiler, but I believe most of the contemporary compilers will optimize that for you if the value of SlowVariable is constant.
"It" (the language) doesn't say. It must behave as if the variable is evaluated every time, of course.
An optimizing compiler can do a lot of clever things, so it's always best to leave these sorts of micro-optimizations to the compiler.
If you're going down the optimization by hand route, be sure to profile (=measure) and read the generated code.
Actually it depends on "SlowVariable" and on the behavior of your compiler. If your slow variable is e.g. volatile the compiler won't do any effort to cache it as the keyword volatile won't permit it. If it's not "volatile" there is a good chance that the compiler optimizes consecutive accesses to this variable by loading it once into the register.

Do optimization like this make sense?

You have two arrays and a function that counts differences between them:
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
As branching downgrades performance, it can be optimized to:
for( i = 0; i < len; ++i ) {
num_differences += !!(vector1[i] != vector2[i])
}
// !!(..) is to be sure that the result is boolean 0 or 1
so there is no if clause. But does it practically make sense? With GCC (and other compilers) being so smart, does it make sense to play with such optimizations?
The short answer is: "Trust your Compiler".
In general you're not going to see much benefit from optimisations like this unless you're working with really huge datasets. Even then you really need to benchmark the code to see if there is any improvement.
Unless len is several millions large or you're comparing a lot of arrays, then no. The second version is less readable (not so much to an experienced programmer), so I'd prefer the first variant, unless this is the bottleneck (doubtful).
The following codes are generated, with optimizations:
for( i = 0; i < 4; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
00401000 mov ecx,dword ptr [vector1 (40301Ch)]
00401006 xor eax,eax
00401008 cmp ecx,dword ptr [vector2 (40302Ch)]
0040100E je wmain+15h (401015h)
00401010 mov eax,1
00401015 mov edx,dword ptr [vector1+4 (403020h)]
0040101B cmp edx,dword ptr [vector2+4 (403030h)]
00401021 je wmain+26h (401026h)
00401023 add eax,1
00401026 mov ecx,dword ptr [vector1+8 (403024h)]
0040102C cmp ecx,dword ptr [vector2+8 (403034h)]
00401032 je wmain+37h (401037h)
00401034 add eax,1
00401037 mov edx,dword ptr [vector1+0Ch (403028h)]
0040103D cmp edx,dword ptr [vector2+0Ch (403038h)]
00401043 je wmain+48h (401048h)
00401045 add eax,1
}
for( i = 0; i < 4; ++i ) {
num_differences += !!(vector1[i] != vector2[i]);
00401064 mov edx,dword ptr [vector1+0Ch (403028h)]
0040106A xor eax,eax
0040106C cmp edx,dword ptr [vector2+0Ch (403038h)]
00401072 mov edx,dword ptr [vector1+8 (403024h)]
00401078 setne al
0040107B xor ecx,ecx
0040107D cmp edx,dword ptr [vector2+8 (403034h)]
00401083 mov edx,dword ptr [vector1+4 (403020h)]
00401089 setne cl
0040108C add eax,ecx
0040108E xor ecx,ecx
00401090 cmp edx,dword ptr [vector2+4 (403030h)]
00401096 mov edx,dword ptr [vector1 (40301Ch)]
0040109C setne cl
0040109F add eax,ecx
004010A1 xor ecx,ecx
004010A3 cmp edx,dword ptr [vector2 (40302Ch)]
004010A9 setne cl
004010AC add eax,ecx
}
So, actually, the second version is slightly slower (theoretically). 19 instructions for the second vs. 17 for the first.
You should compare the code the compiler generates. It may be equivalent.
The compiler's very smart, but a good engineer can certainly improve a program's performance.
I dont think you are going to do much better, your second example is hard to read/understand for the average programmer which means two things one hard to understand and maintain, two you may be creeping into dark, less tested/supported, corners of the compiler. Drive down the road between the lines, dont wander about on the shoulder or in the wrong lane.
Go with this
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
or this
for( i = 0; i < len; ++i ) {
if( vector1[i] != vector2[i] ) ++num_differences;
}
if it really is bothering you and you have properly concluded this is your performance bottleneck then time the difference between them. From the disassembly shown, and the nature of this platform, it is very difficult to properly time such things and draw the right conclusions. Too many caches, and other factors that cloud over the results, leading to false conclusions, etc. and no two x86 implementations have the same performance so if you happen to tune for your computer you are likely detuning it for another model of x86 or even the same make on a different motherboard with different I/O characteristics.

How come my array index is faster than pointer

Why the array index is faster than pointer?
Isn't pointer supposed to be faster than array index?
** i used time.h clock_t to tested two functions, each loop 2 million times.
Pointer time : 0.018995
Index time : 0.017864
void myPointer(int a[], int size)
{
int *p;
for(p = a; p < &a[size]; p++)
{
*p = 0;
}
}
void myIndex(int a[], int size)
{
int i;
for(i = 0; i < size; i++)
{
a[i] = 0;
}
}
No, never ever pointers are supposed to be faster than array index. If one of the code is faster than the other, it's mostly because some address computations might be different. The question also should provide information of compiler and optimization flags as it can heavily affect the performance.
Array index in your context (array bound is not known) is exactly identical to the pointer operation. From a viewpoint of compilers, it is just different expression of pointer arithmetic. Here is an example of an optimized x86 code in Visual Studio 2010 with full optimization and no inline.
3: void myPointer(int a[], int size)
4: {
013E1800 push edi
013E1801 mov edi,ecx
5: int *p;
6: for(p = a; p < &a[size]; p++)
013E1803 lea ecx,[edi+eax*4]
013E1806 cmp edi,ecx
013E1808 jae myPointer+15h (13E1815h)
013E180A sub ecx,edi
013E180C dec ecx
013E180D shr ecx,2
013E1810 inc ecx
013E1811 xor eax,eax
013E1813 rep stos dword ptr es:[edi]
013E1815 pop edi
7: {
8: *p = 0;
9: }
10: }
013E1816 ret
13: void myIndex(int a[], int size)
14: {
15: int i;
16: for(i = 0; i < size; i++)
013E17F0 test ecx,ecx
013E17F2 jle myIndex+0Ch (13E17FCh)
013E17F4 push edi
013E17F5 xor eax,eax
013E17F7 mov edi,edx
013E17F9 rep stos dword ptr es:[edi]
013E17FB pop edi
17: {
18: a[i] = 0;
19: }
20: }
013E17FC ret
At a glance, myIndex looks faster because the number of instructions are less, however, the two pieces of the code are essentially the same. Both eventually use rep stos, which is a x86's repeating (loop) instruction. The only difference is because of the computation of the loop bound. The for loop in myIndex has the trip count size as it is (i.e., no computation is needed). But, myPointer needs some computation to get the trip count of the for loop. This is the only difference. The important loop operations are just the same. Thus, the difference is negligible.
To summarize, the performance of myPointer and myIndex in an optimized code should be identical.
FYI, if the array's bound is known at compile time, e.g., int A[constant_expression], then the accesses on this array may be much faster than the pointer one. This is mostly because the array accesses are free from the pointer analysis problem. Compilers can perfectly compute the dependency information on computations and accesses on a fixed-size array, so it can do advanced optimizations including automatic parallelization.
However, if computations are pointer based, compilers must perform pointer analysis for further optimization, which is pretty much limited in C/C++. It generally ends up with conservative results on pointer analysis and results in a few optimization opportunity.
Array dereference p[i] is *(p + i). Compilers make use of instructions that do math + dereference in 1 or 2 cycles (e.g. x86 LEA instruction) to optimize for speed.
With the pointer loop, it splits the access and offset into to separate parts and the compiler cannot optimize it.
It may be the comparison in the for loop that is causing the difference. The termination condition is tested on each iteration, and your "pointer" example has a slightly more complicated termination condition (taking the address of &a[size]). Since &a[size] does not change, you could try setting it to a variable to avoid recalculating it on each iteration of the loop.
I would suggest running each loop 200 million times, and then run each loop 10 times, and take the fastest measurement. That will factor out effects from OS scheduling and so on.
I would then suggest you disassemble the code for each loop.
Oops, on my 64-bit system results are quite different. I've got that this
int i;
for(i = 0; i < size; i++)
{
*(a+i) = 0;
}
is about 100 times !! slower than this
int i;
int * p = a;
for(i = 0; i < size; i++)
{
*(p++) = 0;
}
when compiling with -O3. This hints to me that somehow moving to next address is far easier to achieve for 64-bit cpu, than to calculate destination address from some offset. But i'm not sure.
EDIT: This really has something related with 64-bit architecture because same code with same compile flags doesn't shows any real difference in performance on 32-bit system.
Compiler optimizations are pattern matching.
When your compiler optimizes, it looks for known code patterns, and then transforms the code according to some rule. Your two code snippets seem to trigger different transforms, and thus produce slightly different code.
This is one of the reasons why we always insist to actually measure the resulting performance when it comes to optimizations: You can never be sure what your compiler turns your code into unless you test it.
If you are really curious, try compiling your code with gcc -S -Os, this produces the most readable, yet optimized assembler code. On your two functions, I get the following assembler with that:
pointer code:
.L2:
cmpq %rax, %rdi
jnb .L5
movl $0, (%rdi)
addq $4, %rdi
jmp .L2
.L5:
index code:
.L7:
cmpl %eax, %esi
jle .L9
movl $0, (%rdi,%rax,4)
incq %rax
jmp .L7
.L9:
The differences are slight, but may indeed trigger a performance difference, most importantly the difference between using addq and incq could be significant.
The times are so close together that if you did them repeatedly, you may not see much of a difference. Both code segments compile to the exact same assembly. By definition, there is no difference.
It looks like the index solution can save a few instructions with the compare in the for loop.
Access the data through array index or pointer is exactly equivalent. Go through the below program with me...
There are a loop which continues to 100 times but when we see disassemble code that there are the data which we access through has least instruction comparability to access through array Index
But it doesn't mean that accessing data through pointer is fast actually it's depend on the instruction which performed by compiler.Both the pointer and array index used the address array access the value from offset and increment through it and pointer has address.
int a[100];
fun1(a,100);
fun2(&a[0],5);
}
void fun1(int a[],int n)
{
int i;
for(i=0;i<=99;i++)
{
a[i]=0;
printf("%d\n",a[i]);
}
}
void fun2(int *p,int n)
{
int i;
for(i=0;i<=99;i++)
{
*p=0;
printf("%d\n",*(p+i));
}
}
disass fun1
Dump of assembler code for function fun1:
0x0804841a <+0>: push %ebp
0x0804841b <+1>: mov %esp,%ebp
0x0804841d <+3>: sub $0x28,%esp`enter code here`
0x08048420 <+6>: movl $0x0,-0xc(%ebp)
0x08048427 <+13>: jmp 0x8048458 <fun1+62>
0x08048429 <+15>: mov -0xc(%ebp),%eax
0x0804842c <+18>: shl $0x2,%eax
0x0804842f <+21>: add 0x8(%ebp),%eax
0x08048432 <+24>: movl $0x0,(%eax)
0x08048438 <+30>: mov -0xc(%ebp),%eax
0x0804843b <+33>: shl $0x2,%eax
0x0804843e <+36>: add 0x8(%ebp),%eax
0x08048441 <+39>: mov (%eax),%edx
0x08048443 <+41>: mov $0x8048570,%eax
0x08048448 <+46>: mov %edx,0x4(%esp)
0x0804844c <+50>: mov %eax,(%esp)
0x0804844f <+53>: call 0x8048300 <printf#plt>
0x08048454 <+58>: addl $0x1,-0xc(%ebp)
0x08048458 <+62>: cmpl $0x63,-0xc(%ebp)
0x0804845c <+66>: jle 0x8048429 <fun1+15>
0x0804845e <+68>: leave
0x0804845f <+69>: ret
End of assembler dump.
(gdb) disass fun2
Dump of assembler code for function fun2:
0x08048460 <+0>: push %ebp
0x08048461 <+1>: mov %esp,%ebp
0x08048463 <+3>: sub $0x28,%esp
0x08048466 <+6>: movl $0x0,-0xc(%ebp)
0x0804846d <+13>: jmp 0x8048498 <fun2+56>
0x0804846f <+15>: mov 0x8(%ebp),%eax
0x08048472 <+18>: movl $0x0,(%eax)
0x08048478 <+24>: mov -0xc(%ebp),%eax
0x0804847b <+27>: shl $0x2,%eax
0x0804847e <+30>: add 0x8(%ebp),%eax
0x08048481 <+33>: mov (%eax),%edx
0x08048483 <+35>: mov $0x8048570,%eax
0x08048488 <+40>: mov %edx,0x4(%esp)
0x0804848c <+44>: mov %eax,(%esp)
0x0804848f <+47>: call 0x8048300 <printf#plt>
0x08048494 <+52>: addl $0x1,-0xc(%ebp)
0x08048498 <+56>: cmpl $0x63,-0xc(%ebp)
0x0804849c <+60>: jle 0x804846f <fun2+15>
0x0804849e <+62>: leave
0x0804849f <+63>: ret
End of assembler dump.
(gdb)
This is a very hard thing to time, because compilers are very good at optimising these things. Still it's better to give the compiler as much information as possible, that's why in this case I'd advise using std::fill, and let the compiler choose.
But... If you want to get into the detail
a) CPU's normally give pointer+value for free, like : mov r1, r2(r3).
b) This means an index operation requires just : mul r3,r1,size
This is just one cycle extra, per loop.
c) CPU's often provide stall/delay slots, meaning you can often hide single-cycle operations.
All in all, even if your loops are very large, the cost of the access is nothing compared to the cost of even a few cache-misses. You are best advised to optimise your structures before you care about loop costs. Try for example, packing your structures to reduce the memory footprint first

Resources