How come my array index is faster than pointer

How come my array index is faster than pointer - c

Why the array index is faster than pointer?
Isn't pointer supposed to be faster than array index?
** i used time.h clock_t to tested two functions, each loop 2 million times.
Pointer time : 0.018995
Index time : 0.017864
void myPointer(int a[], int size)
{
int *p;
for(p = a; p < &a[size]; p++)
{
*p = 0;
}
}
void myIndex(int a[], int size)
{
int i;
for(i = 0; i < size; i++)
{
a[i] = 0;
}
}

No, never ever pointers are supposed to be faster than array index. If one of the code is faster than the other, it's mostly because some address computations might be different. The question also should provide information of compiler and optimization flags as it can heavily affect the performance.
Array index in your context (array bound is not known) is exactly identical to the pointer operation. From a viewpoint of compilers, it is just different expression of pointer arithmetic. Here is an example of an optimized x86 code in Visual Studio 2010 with full optimization and no inline.
3: void myPointer(int a[], int size)
4: {
013E1800 push edi
013E1801 mov edi,ecx
5: int *p;
6: for(p = a; p < &a[size]; p++)
013E1803 lea ecx,[edi+eax*4]
013E1806 cmp edi,ecx
013E1808 jae myPointer+15h (13E1815h)
013E180A sub ecx,edi
013E180C dec ecx
013E180D shr ecx,2
013E1810 inc ecx
013E1811 xor eax,eax
013E1813 rep stos dword ptr es:[edi]
013E1815 pop edi
7: {
8: *p = 0;
9: }
10: }
013E1816 ret
13: void myIndex(int a[], int size)
14: {
15: int i;
16: for(i = 0; i < size; i++)
013E17F0 test ecx,ecx
013E17F2 jle myIndex+0Ch (13E17FCh)
013E17F4 push edi
013E17F5 xor eax,eax
013E17F7 mov edi,edx
013E17F9 rep stos dword ptr es:[edi]
013E17FB pop edi
17: {
18: a[i] = 0;
19: }
20: }
013E17FC ret
At a glance, myIndex looks faster because the number of instructions are less, however, the two pieces of the code are essentially the same. Both eventually use rep stos, which is a x86's repeating (loop) instruction. The only difference is because of the computation of the loop bound. The for loop in myIndex has the trip count size as it is (i.e., no computation is needed). But, myPointer needs some computation to get the trip count of the for loop. This is the only difference. The important loop operations are just the same. Thus, the difference is negligible.
To summarize, the performance of myPointer and myIndex in an optimized code should be identical.
FYI, if the array's bound is known at compile time, e.g., int A[constant_expression], then the accesses on this array may be much faster than the pointer one. This is mostly because the array accesses are free from the pointer analysis problem. Compilers can perfectly compute the dependency information on computations and accesses on a fixed-size array, so it can do advanced optimizations including automatic parallelization.
However, if computations are pointer based, compilers must perform pointer analysis for further optimization, which is pretty much limited in C/C++. It generally ends up with conservative results on pointer analysis and results in a few optimization opportunity.

Array dereference p[i] is *(p + i). Compilers make use of instructions that do math + dereference in 1 or 2 cycles (e.g. x86 LEA instruction) to optimize for speed.
With the pointer loop, it splits the access and offset into to separate parts and the compiler cannot optimize it.

It may be the comparison in the for loop that is causing the difference. The termination condition is tested on each iteration, and your "pointer" example has a slightly more complicated termination condition (taking the address of &a[size]). Since &a[size] does not change, you could try setting it to a variable to avoid recalculating it on each iteration of the loop.

I would suggest running each loop 200 million times, and then run each loop 10 times, and take the fastest measurement. That will factor out effects from OS scheduling and so on.
I would then suggest you disassemble the code for each loop.

Oops, on my 64-bit system results are quite different. I've got that this
int i;
for(i = 0; i < size; i++)
{
*(a+i) = 0;
}
is about 100 times !! slower than this
int i;
int * p = a;
for(i = 0; i < size; i++)
{
*(p++) = 0;
}
when compiling with -O3. This hints to me that somehow moving to next address is far easier to achieve for 64-bit cpu, than to calculate destination address from some offset. But i'm not sure.
EDIT: This really has something related with 64-bit architecture because same code with same compile flags doesn't shows any real difference in performance on 32-bit system.

Compiler optimizations are pattern matching.
When your compiler optimizes, it looks for known code patterns, and then transforms the code according to some rule. Your two code snippets seem to trigger different transforms, and thus produce slightly different code.
This is one of the reasons why we always insist to actually measure the resulting performance when it comes to optimizations: You can never be sure what your compiler turns your code into unless you test it.
If you are really curious, try compiling your code with gcc -S -Os, this produces the most readable, yet optimized assembler code. On your two functions, I get the following assembler with that:
pointer code:
.L2:
cmpq %rax, %rdi
jnb .L5
movl $0, (%rdi)
addq $4, %rdi
jmp .L2
.L5:
index code:
.L7:
cmpl %eax, %esi
jle .L9
movl $0, (%rdi,%rax,4)
incq %rax
jmp .L7
.L9:
The differences are slight, but may indeed trigger a performance difference, most importantly the difference between using addq and incq could be significant.

The times are so close together that if you did them repeatedly, you may not see much of a difference. Both code segments compile to the exact same assembly. By definition, there is no difference.

It looks like the index solution can save a few instructions with the compare in the for loop.

Access the data through array index or pointer is exactly equivalent. Go through the below program with me...
There are a loop which continues to 100 times but when we see disassemble code that there are the data which we access through has least instruction comparability to access through array Index
But it doesn't mean that accessing data through pointer is fast actually it's depend on the instruction which performed by compiler.Both the pointer and array index used the address array access the value from offset and increment through it and pointer has address.
int a[100];
fun1(a,100);
fun2(&a[0],5);
}
void fun1(int a[],int n)
{
int i;
for(i=0;i<=99;i++)
{
a[i]=0;
printf("%d\n",a[i]);
}
}
void fun2(int *p,int n)
{
int i;
for(i=0;i<=99;i++)
{
*p=0;
printf("%d\n",*(p+i));
}
}
disass fun1
Dump of assembler code for function fun1:
0x0804841a <+0>: push %ebp
0x0804841b <+1>: mov %esp,%ebp
0x0804841d <+3>: sub $0x28,%esp`enter code here`
0x08048420 <+6>: movl $0x0,-0xc(%ebp)
0x08048427 <+13>: jmp 0x8048458 <fun1+62>
0x08048429 <+15>: mov -0xc(%ebp),%eax
0x0804842c <+18>: shl $0x2,%eax
0x0804842f <+21>: add 0x8(%ebp),%eax
0x08048432 <+24>: movl $0x0,(%eax)
0x08048438 <+30>: mov -0xc(%ebp),%eax
0x0804843b <+33>: shl $0x2,%eax
0x0804843e <+36>: add 0x8(%ebp),%eax
0x08048441 <+39>: mov (%eax),%edx
0x08048443 <+41>: mov $0x8048570,%eax
0x08048448 <+46>: mov %edx,0x4(%esp)
0x0804844c <+50>: mov %eax,(%esp)
0x0804844f <+53>: call 0x8048300 <printf#plt>
0x08048454 <+58>: addl $0x1,-0xc(%ebp)
0x08048458 <+62>: cmpl $0x63,-0xc(%ebp)
0x0804845c <+66>: jle 0x8048429 <fun1+15>
0x0804845e <+68>: leave
0x0804845f <+69>: ret
End of assembler dump.
(gdb) disass fun2
Dump of assembler code for function fun2:
0x08048460 <+0>: push %ebp
0x08048461 <+1>: mov %esp,%ebp
0x08048463 <+3>: sub $0x28,%esp
0x08048466 <+6>: movl $0x0,-0xc(%ebp)
0x0804846d <+13>: jmp 0x8048498 <fun2+56>
0x0804846f <+15>: mov 0x8(%ebp),%eax
0x08048472 <+18>: movl $0x0,(%eax)
0x08048478 <+24>: mov -0xc(%ebp),%eax
0x0804847b <+27>: shl $0x2,%eax
0x0804847e <+30>: add 0x8(%ebp),%eax
0x08048481 <+33>: mov (%eax),%edx
0x08048483 <+35>: mov $0x8048570,%eax
0x08048488 <+40>: mov %edx,0x4(%esp)
0x0804848c <+44>: mov %eax,(%esp)
0x0804848f <+47>: call 0x8048300 <printf#plt>
0x08048494 <+52>: addl $0x1,-0xc(%ebp)
0x08048498 <+56>: cmpl $0x63,-0xc(%ebp)
0x0804849c <+60>: jle 0x804846f <fun2+15>
0x0804849e <+62>: leave
0x0804849f <+63>: ret
End of assembler dump.
(gdb)

This is a very hard thing to time, because compilers are very good at optimising these things. Still it's better to give the compiler as much information as possible, that's why in this case I'd advise using std::fill, and let the compiler choose.
But... If you want to get into the detail
a) CPU's normally give pointer+value for free, like : mov r1, r2(r3).
b) This means an index operation requires just : mul r3,r1,size
This is just one cycle extra, per loop.
c) CPU's often provide stall/delay slots, meaning you can often hide single-cycle operations.
All in all, even if your loops are very large, the cost of the access is nothing compared to the cost of even a few cache-misses. You are best advised to optimise your structures before you care about loop costs. Try for example, packing your structures to reduce the memory footprint first

Related

Does GCC generate suboptimal code for static branch prediction?

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance:
if (check_collision(player, enemy)) { // very unlikely to be true
doA();
} else {
doB();
}
may be rewritten as:
if (!check_collision(player, enemy)) {
doB();
} else {
doA();
}
I found a blog post Branch Patterns, Using GCC, which explains this phenomenon in more detail:
Forward branches are generated for if statements. The rationale for
making them not likely to be taken is that the processor can take
advantage of the fact that instructions following the branch
instruction may already be placed in the instruction buffer inside the
Instruction Unit.
next to it, it says (emphasis mine):
When writing an if-else statement, always make the "then" block more
likely to be executed than the else block, so the processor can take
advantage of instructions already placed in the instruction fetch
buffer.
Ultimately, there is article, written by Intel, Branch and Loop Reorganization to Prevent Mispredicts, which summarizes this with two rules:
Static branch prediction is used when there is no data collected by the
microprocessor when it encounters a branch, which is typically the
first time a branch is encountered. The rules are simple:
A forward branch defaults to not taken
A backward branch defaults to taken
In order to effectively write your code to take advantage of these
rules, when writing if-else or switch statements, check the most
common cases first and work progressively down to the least common.
As I understand, the idea is that pipelined CPU may follow the instructions from the instruction cache without breaking it by jumping to another address within code segment. I am aware, though, that this may be largly oversimplified in case modern CPU microarchitectures.
However, it looks like GCC doesn't respect these rules. Given the code:
extern void foo();
extern void bar();
int some_func(int n)
{
if (n) {
foo();
}
else {
bar();
}
return 0;
}
it generates (version 6.3.0 with -O3 -mtune=intel):
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
jne .L6 ; here, forward branch if (n) is (conditionally) taken
call bar
xor eax, eax
lea rsp, [rsp+8]
ret
.L6:
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
The only way, that I found to force the desired behavior is by rewriting the if condition using __builtin_expect as follows:
if (__builtin_expect(n, 1)) { // force n condition to be treated as true
so the assembly code would become:
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
je .L2 ; here, backward branch is (conditionally) taken
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
.L2:
call bar
xor eax, eax
lea rsp, [rsp+8]
ret

The short answer: no, it is not.
GCC does metrics ton of non trivial optimization and one of them is guessing branch probabilities judging by control flow graph.
According to GCC manual:
fno-guess-branch-probability
Do not guess branch probabilities using
heuristics.
GCC uses heuristics to guess branch probabilities if they are not
provided by profiling feedback (-fprofile-arcs). These heuristics are
based on the control flow graph. If some branch probabilities are
specified by __builtin_expect, then the heuristics are used to guess
branch probabilities for the rest of the control flow graph, taking
the __builtin_expect info into account. The interactions between the
heuristics and __builtin_expect can be complex, and in some cases, it
may be useful to disable the heuristics so that the effects of
__builtin_expect are easier to understand.
-freorder-blocks may swap branches as well.
Also, as OP mentioned the behavior might be overridden with __builtin_expect.
Proof
Look at the following listing.
void doA() { printf("A\n"); }
void doB() { printf("B\n"); }
int check_collision(void* a, void* b)
{ return a == b; }
void some_func (void* player, void* enemy) {
if (check_collision(player, enemy)) {
doA();
} else {
doB();
}
}
int main() {
// warming up gcc statistic
some_func((void*)0x1, NULL);
some_func((void*)0x2, NULL);
some_func((void*)0x3, NULL);
some_func((void*)0x4, NULL);
some_func((void*)0x5, NULL);
some_func(NULL, NULL);
return 0;
}
It is obvious that check_collision will return 0 most of the times. So, the doB() branch is likely and GCC can guess this:
gcc -O main.c -o opt.a
objdump -d opt.a
The asm of some_func is:
sub $0x8,%rsp
cmp %rsi,%rdi
je 6c6 <some_func+0x18>
mov $0x0,%eax
callq 68f <doB>
add $0x8,%rsp
retq
mov $0x0,%eax
callq 67a <doA>
jmp 6c1 <some_func+0x13>
But for sure, we can enforce GCC from being too smart:
gcc -fno-guess-branch-probability main.c -o non-opt.a
objdump -d non-opt.a
And we will get:
push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
mov %rdi,-0x8(%rbp)
mov %rsi,-0x10(%rbp)
mov -0x10(%rbp),%rdx
mov -0x8(%rbp),%rax
mov %rdx,%rsi
mov %rax,%rdi
callq 6a0 <check_collision>
test %eax,%eax
je 6ef <some_func+0x33>
mov $0x0,%eax
callq 67a <doA>
jmp 6f9 <some_func+0x3d>
mov $0x0,%eax
callq 68d <doB>
nop
leaveq
retq
So GCC will leave branches in source order.
I used gcc 7.1.1 for those tests.

I Think That You've Found A "Bug"
The funny thing is that optimization for space and no optimization are the only cases in which the "optimal" instruction code is generated: gcc -S [-O0 | -Os] source.c
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
jmp L3
2:
call _bar
3:
movl $0, %eax
# Or, for -Os:
# xorl %eax, %eax
leave
ret
My point is that ...
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
... up to & through the call to foo everything is "optimal", in the traditional sense, regardless of the exit strategy.
Optimality is ultimately determined by the processor, of course.

Difference in for loops of old and new GCC's generated assembly code

I am reading a chapter about assembly code, which has an example. Here is the C program:
int main()
{
int i;
for(i=0; i < 10; i++)
{
puts("Hello, world!\n");
}
return 0;
}
Here is the assembly code provided in the book:
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
Here is part of my version:
0x0000000000400538 <+8>: mov DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>: jmp 0x40054f <main+31>
0x0000000000400541 <+17>: mov edi,0x4005f0
0x0000000000400546 <+22>: call 0x400410 <puts#plt>
0x000000000040054b <+27>: add DWORD PTR [rbp-0x4],0x1
0x000000000040054f <+31>: cmp DWORD PTR [rbp-0x4],0x9
0x0000000000400553 <+35>: jle 0x400541 <main+17>
My question is, why is in case of the book's version it assigns 0 to the variable(mov DWORD PTR [ebp-4],0x0) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?
It seems more logical to assign and compare without any jump, because it is like that inside for loop.

Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.
Let's reason about what the for loop does.
for (i = 0; i < 10; i++) {
code;
}
Let's write it a little bit closer to the assembler that was generated by the first compiler generated.
i = 0;
start: if (i > 9) goto out;
code;
i++;
goto start;
out:
Now the same thing for "my version":
i = 0;
goto cmp;
start: code;
i++;
cmp: if (i < 10) goto start;
The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.
Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.

You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):
# ... establish stack frame ...
mov DWORD PTR [rbp-4],0x0
cmp DWORD PTR [rbp-4],0x9
jle .L0
.L1:
mov rdi, .Lconst0
call puts
add DWORD PTR [rbp-0x4],0x1
cmp DWORD PTR [rbp-0x4],0x9
jle .L1
.L0:
# ... return from function ...
GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion. Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:
# ... establish stack frame ...
mov ebx, 10
mov r14, .Lconst0
.L1:
mov rdi, r14
call puts
dec ebx
jne .L1
# ... return from function ...
(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain.)

Does GCC cache loop variables?

When I have a loop like:
for (int i = 0; i < SlowVariable; i++)
{
//
}
I know that in VB6 the SlowVariable is accessed every iteration of the loop, making the following much more efficient:
int cnt = SlowVariable;
for (int i = 0; i < cnt; i++)
{
//
}
Do I need the make the same optimizations in GCC? Or does it evaluate SlowVariable only once?

This is called "hoisting" SlowVariabele out of the loop.
The compiler can do it only if it can prove that the value of SlowVariabele is the same every time, and that evaluating SlowVariabele has no side-effects.
So for example consider the following code (I assume for the sake of example that accessing through a pointer is "slow" for some reason):
void foo1(int *SlowVariabele, int *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The compiler cannot (in general) hoist, because for all it knows it will be called with result == SlowVariabele, and so the value of *SlowVariabele is changing during the loop.
On the other hand:
void foo2(int *result) {
int val = 12;
int *SlowVariabele = &val;
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
Now at least in principle, the compiler can know that val never changes in the loop, and so it can hoist. Whether it actually does so is a matter of how aggressive the optimizer is and how good its analysis of the function is, but I'd expect any serious compiler to be capable of it.
Similarly, if foo1 was called with pointers that the compiler can determine (at the call site) are non-equal, and if the call is inlined, then the compiler could hoist. That's what restrict is for:
void foo3(int *restrict SlowVariabele, int *restrict result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
restrict (introduced in C99) means "you must not call this function with result == SlowVariabele", and allows the compiler to hoist.
Similarly:
void foo4(int *SlowVariabele, float *result) {
for (int i = 0; i < *SlowVariabele; ++i) {
--*result;
}
}
The strict aliasing rules mean that SlowVariable and result must not refer to the same location (or the program has undefined behaviour anyway), and so again the compiler can hoist.

Generally, variables can't be slow (or fast) unless they are mapped to some weird kind of memory (you usually want to declare them volatile in this case).
But indeed, using a local variable creates more opportunities for optimization, and the effect may be very visible. The compiler can "cache" a global variable by itself only if it's able to prove that no function called within a loop can read or write that global variable. When you call an external function within a loop, the compiler probably won't be able to prove such a thing.

This depends on how the compiler optimize, for example here:
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned int i;
unsigned int z = 10;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
return 0;
}
If you compiled it using gcc example.c -o example, the result code will be:
0x0040138c <+0>: push ebp
0x0040138d <+1>: mov ebp,esp
0x0040138f <+3>: and esp,0xfffffff0
0x00401392 <+6>: sub esp,0x20
0x00401395 <+9>: call 0x4018f4 <__main>
0x0040139a <+14>: mov DWORD PTR [esp+0x18],0xa
0x004013a2 <+22>: mov DWORD PTR [esp+0x1c],0x0
0x004013aa <+30>: jmp 0x4013c4 <main+56>
0x004013ac <+32>: mov eax,DWORD PTR [esp+0x1c]
0x004013b0 <+36>: mov DWORD PTR [esp+0x4],eax
0x004013b4 <+40>: mov DWORD PTR [esp],0x403064
0x004013bb <+47>: call 0x401b2c <printf>
0x004013c0 <+52>: inc DWORD PTR [esp+0x1c]
0x004013c4 <+56>: mov eax,DWORD PTR [esp+0x1c] ; (1)
0x004013c8 <+60>: cmp eax,DWORD PTR [esp+0x18] ; (2)
0x004013cc <+64>: jb 0x4013ac <main+32>
0x004013ce <+66>: mov eax,0x0
0x004013d3 <+71>: leave
0x004013d4 <+72>: ret
0x004013d5 <+73>: nop
0x004013d6 <+74>: nop
0x004013d7 <+75>: nop
The value of i will be movied from the stack into eax.
Then the CPU will compare eax or i, with the value of z, which is in the stack.
All of this happen on every round.
If you optimized the code using gcc -O2 example.c -o example, the result will be:
0x00401b70 <+0>: push ebp
0x00401b71 <+1>: mov ebp,esp
0x00401b73 <+3>: push ebx
0x00401b74 <+4>: and esp,0xfffffff0
0x00401b77 <+7>: sub esp,0x10
0x00401b7a <+10>: call 0x4018a8 <__main>
0x00401b7f <+15>: xor ebx,ebx
0x00401b81 <+17>: lea esi,[esi+0x0]
0x00401b84 <+20>: mov DWORD PTR [esp+0x4],ebx
0x00401b88 <+24>: mov DWORD PTR [esp],0x403064
0x00401b8f <+31>: call 0x401ae0 <printf>
0x00401b94 <+36>: inc ebx
0x00401b95 <+37>: cmp ebx,0xa ; (1)
0x00401b98 <+40>: jne 0x401b84 <main+20>
0x00401b9a <+42>: xor eax,eax
0x00401b9c <+44>: mov ebx,DWORD PTR [ebp-0x4]
0x00401b9f <+47>: leave
0x00401ba0 <+48>: ret
0x00401ba1 <+49>: nop
0x00401ba2 <+50>: nop
0x00401ba3 <+51>: nop
The compiler knows that there is no point of checking the value of z, so it modifies the code to something like for( i = 0 ; i < 10 ; i++ ).
In case the compiler doesn't konw the value of z like in this code:
#include <stdio.h>
void loop(unsigned int z) {
unsigned int i;
for( i = 0 ; i < z ; i++ )
printf("%d\n", i);
}
int main(int argc, char **argv)
{
unsigned int z = 10;
loop(z);
return 0;
}
The result will be:
0x0040138c <+0>: push esi
0x0040138d <+1>: push ebx
0x0040138e <+2>: sub esp,0x14
0x00401391 <+5>: mov esi,DWORD PTR [esp+0x20] ; (1)
0x00401395 <+9>: test esi,esi
0x00401397 <+11>: je 0x4013b1 <loop+37>
0x00401399 <+13>: xor ebx,ebx ; (2)
0x0040139b <+15>: nop
0x0040139c <+16>: mov DWORD PTR [esp+0x4],ebx
0x004013a0 <+20>: mov DWORD PTR [esp],0x403064
0x004013a7 <+27>: call 0x401b0c <printf>
0x004013ac <+32>: inc ebx
0x004013ad <+33>: cmp ebx,esi
0x004013af <+35>: jne 0x40139c <loop+16>
0x004013b1 <+37>: add esp,0x14
0x004013b4 <+40>: pop ebx
0x004013b5 <+41>: pop esi
0x004013b6 <+42>: ret
0x004013b7 <+43>: nop
z will endup in some unused register esi, registers are the fastest storage classed.
There is no local variable i, on the stack, the compiler used ebx to store the value of i, also register.
After all, it depends on the compiler and the optimization options you use, but, in all cases, C still faster, much faster, than VB.

It depends on your compiler, but I believe most of the contemporary compilers will optimize that for you if the value of SlowVariable is constant.

"It" (the language) doesn't say. It must behave as if the variable is evaluated every time, of course.
An optimizing compiler can do a lot of clever things, so it's always best to leave these sorts of micro-optimizations to the compiler.
If you're going down the optimization by hand route, be sure to profile (=measure) and read the generated code.

Actually it depends on "SlowVariable" and on the behavior of your compiler. If your slow variable is e.g. volatile the compiler won't do any effort to cache it as the keyword volatile won't permit it. If it's not "volatile" there is a good chance that the compiler optimizes consecutive accesses to this variable by loading it once into the register.

Buffer overflow example not working on Debian 2.6

I am trying to make the buffer exploitation example (example3.c from http://insecure.org/stf/smashstack.html) work on Debian Lenny 2.6 version. I know the gcc version and the OS version is different than the one used by Aleph One. I have disabled any stack protection mechanisms using -fno-stack-protector and sysctl -w kernel.randomize_va_space=0 arguments. To account for the differences in my setup and Aleph One's I introduced two parameters : offset1 -> Offset from buffer1 variable to the return address and offset2 -> how many bytes to jump to skip a statement. I tried to figure out these parameters by analyzing assembly code but was not successful. So, I wrote a shell script that basically runs the buffer overflow program with simultaneous values of offset1 and offset2 from (1-60). But much to my surprise I am still not able to break this program. It would be great if someone can guide me for the same. I have attached the code and assembly output for consideration. Sorry for the really long post :)
Thanks.
// Modified example3.c from Aleph One paper - Smashing the stack
void function(int a, int b, int c, int offset1, int offset2) {
char buffer1[5];
char buffer2[10];
int *ret;
ret = (int *)buffer1 + offset1;// how far is return address from buffer ?
(*ret) += offset2; // modify the value of return address
}
int main(int argc, char* argv[]) {
int x;
x = 0;
int offset1 = atoi(argv[1]);
int offset2 = atoi(argv[2]);
function(1,2,3, offset1, offset2);
x = 1; // Goal is to skip this statement using buffer overflow
printf("X : %d\n",x);
return 0;
}
-----------------
// Execute the buffer overflow program with varying offsets
#!/bin/bash
for ((i=1; i<=60; i++))
do
for ((j=1; j<=60; j++))
do
echo "`./test $i $j`"
done
done
-- Assembler output
(gdb) disassemble main
Dump of assembler code for function main:
0x080483c2 <main+0>: lea 0x4(%esp),%ecx
0x080483c6 <main+4>: and $0xfffffff0,%esp
0x080483c9 <main+7>: pushl -0x4(%ecx)
0x080483cc <main+10>: push %ebp
0x080483cd <main+11>: mov %esp,%ebp
0x080483cf <main+13>: push %ecx
0x080483d0 <main+14>: sub $0x24,%esp
0x080483d3 <main+17>: movl $0x0,-0x8(%ebp)
0x080483da <main+24>: movl $0x3,0x8(%esp)
0x080483e2 <main+32>: movl $0x2,0x4(%esp)
0x080483ea <main+40>: movl $0x1,(%esp)
0x080483f1 <main+47>: call 0x80483a4 <function>
0x080483f6 <main+52>: movl $0x1,-0x8(%ebp)
0x080483fd <main+59>: mov -0x8(%ebp),%eax
0x08048400 <main+62>: mov %eax,0x4(%esp)
0x08048404 <main+66>: movl $0x80484e0,(%esp)
0x0804840b <main+73>: call 0x80482d8 <printf#plt>
0x08048410 <main+78>: mov $0x0,%eax
0x08048415 <main+83>: add $0x24,%esp
0x08048418 <main+86>: pop %ecx
0x08048419 <main+87>: pop %ebp
0x0804841a <main+88>: lea -0x4(%ecx),%esp
0x0804841d <main+91>: ret
End of assembler dump.
(gdb) disassemble function
Dump of assembler code for function function:
0x080483a4 <function+0>: push %ebp
0x080483a5 <function+1>: mov %esp,%ebp
0x080483a7 <function+3>: sub $0x20,%esp
0x080483aa <function+6>: lea -0x9(%ebp),%eax
0x080483ad <function+9>: add $0x30,%eax
0x080483b0 <function+12>: mov %eax,-0x4(%ebp)
0x080483b3 <function+15>: mov -0x4(%ebp),%eax
0x080483b6 <function+18>: mov (%eax),%eax
0x080483b8 <function+20>: lea 0x7(%eax),%edx
0x080483bb <function+23>: mov -0x4(%ebp),%eax
0x080483be <function+26>: mov %edx,(%eax)
0x080483c0 <function+28>: leave
0x080483c1 <function+29>: ret
End of assembler dump.

The disassembly for function you provided seems to use hardcoded values of offset1 and offset2, contrary to your C code.
The address for ret should be calculated using byte/char offsets: ret = (int *)(buffer1 + offset1), otherwise you'll get hit by pointer math (especially in this case, when your buffer1 is not at a nice aligned offset from the return address).
offset1 should be equal to 0x9 + 0x4 (the offset used in lea + 4 bytes for the push %ebp). However, this can change unpredictably each time you compile - the stack layout might be different, the compiler might create some additional stack alignment, etc.
offset2 should be equal to 7 (the length of the instruction you're trying to skip).
Note that you're getting a little lucky here - the function uses the cdecl calling convention, which means the caller is responsible for removing arguments off the stack after returning from the function, which normally looks like this:
push arg3
push arg2
push arg1
call func
add esp, 0Ch ; remove as many bytes as were used by the pushed arguments
Your compiler chose to combine this correction with the one after printf, but it could also decide to do this after your function call. In this case the add esp, <number> instruction would be present between your return address and the instruction you want to skip - you can probably imagine that this would not end well.

My overflow code does not work

The code below is from the well-known article Smashing The Stack For Fun And Profit.
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
int *ret;
ret = buffer1 + 12;
(*ret)+=8;
}
void main() {
int x;
x=0;
function(1,2,3);
x=1;
printf("%d\n",x);
}
I think I must explain my target of this code.
The stack model is below. The number below the word is the number of bytes of the variable in the stack. So, if I want to rewrite RET to skip the statement I want, I calculate the offset from buffer1 to RET is 8+4=12. Since the architecture is x86 Linux.
buffer2 buffer1 BSP RET a b c
(12) (8) (4) (4) (4) (4) (4)
I want to skip the statement x=1; and let printf() output 0 on the screen.
I compile the code with:
gcc stack2.c -g
and run it in gdb:
gdb ./a.out
gdb gives me the result like this:
Program received signal SIGSEGV, Segmentation fault.
main () at stack2.c:17
17 x = 1;
I think Linux uses some mechanism to protect against stack overflow. Maybe Linux stores the RET address in another place and compares the RET address in the stack before functions return.
And what is the detail about the mechanism? How should I rewrite the code to make the program output 0?
OK,the disassemble code is below.It comes form the output of gdb since I think is more easy to read for you.And anybody can tell me how to paste a long code sequence?Copy and paste one by one makes me too tired...
Dump of assembler code for function main:
0x08048402 <+0>: push %ebp
0x08048403 <+1>: mov %esp,%ebp
0x08048405 <+3>: sub $0x10,%esp
0x08048408 <+6>: movl $0x0,-0x4(%ebp)
0x0804840f <+13>: movl $0x3,0x8(%esp)
0x08048417 <+21>: movl $0x2,0x4(%esp)
0x0804841f <+29>: movl $0x1,(%esp)
0x08048426 <+36>: call 0x80483e4 <function>
0x0804842b <+41>: movl $0x1,-0x4(%ebp)
0x08048432 <+48>: mov $0x8048520,%eax
0x08048437 <+53>: mov -0x4(%ebp),%edx
0x0804843a <+56>: mov %edx,0x4(%esp)
0x0804843e <+60>: mov %eax,(%esp)
0x08048441 <+63>: call 0x804831c <printf#plt>
0x08048446 <+68>: mov $0x0,%eax
0x0804844b <+73>: leave
0x0804844c <+74>: ret
Dump of assembler code for function function:
0x080483e4 <+0>: push %ebp
0x080483e5 <+1>: mov %esp,%ebp
0x080483e7 <+3>: sub $0x14,%esp
0x080483ea <+6>: lea -0x9(%ebp),%eax
0x080483ed <+9>: add $0x3,%eax
0x080483f0 <+12>: mov %eax,-0x4(%ebp)
0x080483f3 <+15>: mov -0x4(%ebp),%eax
0x080483f6 <+18>: mov (%eax),%eax
0x080483f8 <+20>: lea 0x8(%eax),%edx
0x080483fb <+23>: mov -0x4(%ebp),%eax
0x080483fe <+26>: mov %edx,(%eax)
0x08048400 <+28>: leave
0x08048401 <+29>: ret
I check the assemble code and find some mistake about my program,and I have rewrite (*ret)+=8 to (*ret)+=7,since 0x08048432 <+48>minus0x0804842b <+41> is 7.

Because that article is from 1996 and the assumptions are incorrect.
Refer to "Smashing The Modern Stack For Fun And Profit"
http://www.ethicalhacker.net/content/view/122/24/
From the above link:
However, the GNU C Compiler (gcc) has evolved since 1998, and as a result, many people are left wondering why they can't get the examples to work for them, or if they do get the code to work, why they had to make the changes that they did.

The function function overwrites some place of the stack outside of its own, which is this case is the stack of main. What it overwrites I don't know, but it causes the segmentation fault you see. It might be some protection employed by the operating system, but it might as well be the generated code just does something wrong when wrong value is at that position on the stack.
This is a really good example of what may happen when you write outside of your allocated memory. It might crash directly, it might crash somewhere completely different, or if might not crash at all but instead just do some calculation wrong.

Try ret = buffer1 + 3;
Explanation: ret is an integer pointer; incrementing it by 1 adds 4 bytes to the address on 32bit machines.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How come my array index is faster than pointer - c

Array dereference p[i] is *(p + i). Compilers make use of instructions that do math + dereference in 1 or 2 cycles (e.g. x86 LEA instruction) to optimize for speed. With the pointer loop, it splits the access and offset into to separate parts and the compiler cannot optimize it.

I would suggest running each loop 200 million times, and then run each loop 10 times, and take the fastest measurement. That will factor out effects from OS scheduling and so on. I would then suggest you disassemble the code for each loop.

The times are so close together that if you did them repeatedly, you may not see much of a difference. Both code segments compile to the exact same assembly. By definition, there is no difference.

It looks like the index solution can save a few instructions with the compare in the for loop.

Related

Does GCC generate suboptimal code for static branch prediction?

Difference in for loops of old and new GCC's generated assembly code

Does GCC cache loop variables?

Buffer overflow example not working on Debian 2.6

My overflow code does not work

Categories

Resources