Why would gcc -O3 generate multiple ret instructions? [duplicate] - c

This question already has an answer here:
Why does GCC emit a repeated `ret`?
(1 answer)
Closed 2 months ago.
I was looking at some recursive function from here:
int get_steps_to_zero(int n)
{
if (n == 0) {
// Base case: we have reached zero
return 0;
} else if (n % 2 == 0) {
// Recursive case 1: we can divide by 2
return 1 + get_steps_to_zero(n / 2);
} else {
// Recursive case 2: we can subtract by 1
return 1 + get_steps_to_zero(n - 1);
}
}
I checked the disassembly in order to check if gcc managed tail-call optimization/unrolling. Looks like it did, though with x86-64 gcc 12.2 -O3 I get a function like this, ending with two ret instructions:
get_steps_to_zero:
xor eax, eax
test edi, edi
jne .L5
jmp .L6
.L10:
mov edx, edi
shr edx, 31
add edi, edx
sar edi
test edi, edi
je .L9
.L5:
add eax, 1
test dil, 1
je .L10
sub edi, 1
test edi, edi
jne .L5
.L9:
ret
.L6:
ret
Godbolt example.
What's the purpose of the multiple returns? Is it a bug?
EDIT
Seems like this appeared from gcc 11.x. When compiling under gcc 10.x, then the function ends like:
.L1:
mov eax, r8d
ret
.L6:
xor r8d, r8d
mov eax, r8d
ret
As in: store result in eax. The 11.x version instead zeroes eax in the beginning of the function then modifies it in the function body, eliminating the need for the extra mov instruction.

This is a manifestation of pass ordering problem. At some point in the optimization pipeline, the two basic blocks ending in ret are not equivalent, then some pass makes them equivalent, but no following pass is capable of collapsing the two equivalent blocks into one.
On Compiler Explorer, you can see how compiler optimization pipeline works by inspecting snapshots of internal representation between passes. For GCC, select "Add New > GCC Tree/RTL" in the compiler pane. Here's your example, with a snapshot immediately preceding the problematic transformation pre-selected in the new pane: https://godbolt.org/z/nTazM5zGG
Towards the end of the dump, you can see the two basic blocks:
65: NOTE_INSN_BASIC_BLOCK 8
77: use ax:SI
66: simple_return
and
43: NOTE_INSN_BASIC_BLOCK 9
5: ax:SI=0
38: use ax:SI
74: NOTE_INSN_EPILOGUE_BEG
75: simple_return
Basically the second block is different in that it sets eax to zero before returning. If you look at the next pass (called "jump2"), you see that it lifts the ax:SI=0 instruction from basic block 9 and basic block 3 to basic block 2, making BB 9 equivalent to BB 8.
If you disable this optimization with -fno-crossjumping, the difference will be carried to the end, making the resulting assembly less surprising.

Conclusion first: This is a deliberate optimization choice by GCC.
If you use GCC locally (gcc -O3 -S) instead of on Godbolt, you can see that there are alignment directives between the two ret instructions:
; top part omitted
.L9:
ret
.p2align 4,,10
.p2align 3
.L6:
ret
.cfi_endproc
The object file, when disassembled, includes an NOP in that padding area:
8: 75 13 jne 1d <get_steps_to_zero+0x1d>
a: eb 24 jmp 30 <get_steps_to_zero+0x30>
c: 0f 1f 40 00 nopl 0x0(%rax)
<...>
2b: 75 f0 jne 1d <get_steps_to_zero+0x1d>
2d: c3 ret
2e: 66 90 xchg %ax,%ax
30: c3 ret
The second ret instruction is aligned to a 16-byte boundary whereas the first one isn't. This allows the processor to load the instruction faster when used as a jump target from a distant source. Subsequent C return statements, however, are close enough to the first ret instruction such that they will not benefit from jumping to aligned targets.
This alignment is even more noticeable on my Zen 2 CPU with -mtune=native, with more padding bytes added:
29: 75 f2 jne 1d <get_steps_to_zero+0x1d>
2b: c3 ret
2c: 0f 1f 40 00 nopl 0x0(%rax)
30: c3 ret

Related

Difficulty understanding logic in disassembled binary bomb phase 3

I have the following assembly program from the binary-bomb lab. The goal is to determine the keyword needed to run the binary without triggering the explode_bomb function. I commented my analysis of the assembly for this program but I am having trouble piecing everything together.
I believe I have all the information I need, but I still am unable to see the actual underlying logic and thus I am stuck. I would greatly appreciate any help!
The following is the disassembled program itself:
0x08048c3c <+0>: push %edi
0x08048c3d <+1>: push %esi
0x08048c3e <+2>: sub $0x14,%esp
0x08048c41 <+5>: movl $0x804a388,(%esp)
0x08048c48 <+12>: call 0x80490ab <string_length>
0x08048c4d <+17>: add $0x1,%eax
0x08048c50 <+20>: mov %eax,(%esp)
0x08048c53 <+23>: call 0x8048800 <malloc#plt>
0x08048c58 <+28>: mov $0x804a388,%esi
0x08048c5d <+33>: mov $0x13,%ecx
0x08048c62 <+38>: mov %eax,%edi
0x08048c64 <+40>: rep movsl %ds:(%esi),%es:(%edi)
0x08048c66 <+42>: movzwl (%esi),%edx
0x08048c69 <+45>: mov %dx,(%edi)
0x08048c6c <+48>: movzbl 0x11(%eax),%edx
0x08048c70 <+52>: mov %dl,0x10(%eax)
0x08048c73 <+55>: mov %eax,0x4(%esp)
0x08048c77 <+59>: mov 0x20(%esp),%eax
0x08048c7b <+63>: mov %eax,(%esp)
0x08048c7e <+66>: call 0x80490ca <strings_not_equal>
0x08048c83 <+71>: test %eax,%eax
0x08048c85 <+73>: je 0x8048c8c <phase_3+80>
0x08048c87 <+75>: call 0x8049363 <explode_bomb>
0x08048c8c <+80>: add $0x14,%esp
0x08048c8f <+83>: pop %esi
0x08048c90 <+84>: pop %edi
0x08048c91 <+85>: ret
The following block contains my analysis
5 <phase_3>
6 0x08048c3c <+0>: push %edi // push value in edi to stack
7 0x08048c3d <+1>: push %esi // push value of esi to stack
8 0x08048c3e <+2>: sub $0x14,%esp // grow stack by 0x14 (move stack ptr -0x14 bytes)
9
10 0x08048c41 <+5>: movl $0x804a388,(%esp) // put 0x804a388 into loc esp points to
11
12 0x08048c48 <+12>: call 0x80490ab <string_length> // check string length, store in eax
13 0x08048c4d <+17>: add $0x1,%eax // increment val in eax by 0x1 (str len + 1)
14 // at this point, eax = str_len + 1 = 77 + 1 = 78
15
16 0x08048c50 <+20>: mov %eax,(%esp) // get val in eax and put in loc on stack
17 //**** at this point, 0x804a388 should have a value of 78? ****
18
19 0x08048c53 <+23>: call 0x8048800 <malloc#plt> // malloc --> base ptr in eax
20
21 0x08048c58 <+28>: mov $0x804a388,%esi // 0x804a388 in esi
22 0x08048c5d <+33>: mov $0x13,%ecx // put 0x13 in ecx (counter register)
23 0x08048c62 <+38>: mov %eax,%edi // put val in eax into edi
24 0x08048c64 <+40>: rep movsl %ds:(%esi),%es:(%edi) // repeat 0x13 (19) times
25 // **** populate malloced memory with first 19 (edit: 76) chars of string at 0x804a388 (this string is 77 characters long)? ****
26
27 0x08048c66 <+42>: movzwl (%esi),%edx // put val in loc esi points to into edx
***** // at this point, edx should contain the string at 0x804a388?
28
29 0x08048c69 <+45>: mov %dx,(%edi) // put val in dx to loc edi points to
***** // not sure what effect this has or what is in edi at this point
30 0x08048c6c <+48>: movzbl 0x11(%eax),%edx // edx = [eax + 0x11]
31 0x08048c70 <+52>: mov %dl,0x10(%eax) // [eax + 0x10] = dl
32 0x08048c73 <+55>: mov %eax,0x4(%esp) // [esp + 0x4] = eax
33 0x08048c77 <+59>: mov 0x20(%esp),%eax // eax = [esp + 0x20]
34 0x08048c7b <+63>: mov %eax,(%esp) // put val in eax into loc esp points to
***** // not sure what effect these movs have
35
36 // edi --> first arg
37 // esi --> second arg
38 // compare value in esi to edi
39 0x08048c7e <+66>: call 0x80490ca <strings_not_equal> // store result in eax
40 0x08048c83 <+71>: test %eax,%eax
41 0x08048c85 <+73>: je 0x8048c8c <phase_3+80>
42 0x08048c87 <+75>: call 0x8049363 <explode_bomb>
43 0x08048c8c <+80>: add $0x14,%esp
44 0x08048c8f <+83>: pop %esi
45 0x08048c90 <+84>: pop %edi
46 0x08048c91 <+85>: ret
Update:
Upon inspecting the registers before strings_not_equal is called, I get the following:
eax 0x804d8aa 134535338
ecx 0x0 0
edx 0x76 118
ebx 0xffffd354 -11436
esp 0xffffd280 0xffffd280
ebp 0xffffd2b8 0xffffd2b8
esi 0x804a3d4 134521812
edi 0x804f744 134543172
eip 0x8048c7b 0x8048c7b <phase_3+63>
eflags 0x282 [ SF IF ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x0 0
gs 0x63 99
and I get the following disassembled pseudocode using Hopper:
I even tried using both the number found in eax and the string seen earlier as my keyword but neither of them worked.
The function makes a modified copy of a string from static storage, into a malloced buffer.
This looks weird. The malloc size is dependent on strlen+1, but the memcpy size is a compile-time constant? Your decompilation apparently shows that address was a string literal so it seems that's fine.
Probably that missed optimization happened because of a custom string_length() function that was maybe only defined in another .c (and the bomb was compiled without link-time optimization for cross-file inlining). So size_t len = string_length("some string literal"); is not a compile-time constant and the compiler emitted a call to it instead of being able to use the known constant length of the string.
But probably they used strcpy in the source and the compiler did inline that as a rep movs. Since it's apparently copying from a string literal, the length is a compile-time constant and it can optimize away that part of the work that strcpy normally has to do. Normally if you've already calculated the length it's better to use memcpy instead of making strcpy calculate it again on the fly, but in this case it actually helped the compiler make better code for that part than if they'd passed the return value of string_length to a memcpy, again because string_length couldn't inline and optimize away.
<+0>: push %edi // push value in edi to stack
<+1>: push %esi // push value of esi to stack
<+2>: sub $0x14,%esp // grow stack by 0x14 (move stack ptr -0x14 bytes)
Comments like that are redundant; the instruction itself already says that. This is saving two call-preserved registers so the function can use them internally and restore them later.
Your comment on the sub is better; yes, grow the stack is the higher level semantic meaning here. This function reserves some space for locals (and for function args to be stored with mov instead of pushed).
The rep movsd copies 0x13 * 4 bytes, incrementing ESI and EDI to point past the end of the copied region. So another movsd instruction would copy another 4 bytes contiguous with the previous copy.
The code actually copies another 2, but instead of using movsw, it uses a movzw word load and a mov store. This makes a total of 78 bytes copied.
...
# at this point EAX = malloc return value which I'll call buf
<+28>: mov $0x804a388,%esi # copy src = a string literal in .rodata?
<+33>: mov $0x13,%ecx
<+38>: mov %eax,%edi # copy dst = buf
<+40>: rep movsl %ds:(%esi),%es:(%edi) # memcpy 76 bytes and advance ESI, EDI
<+42>: movzwl (%esi),%edx
<+45>: mov %dx,(%edi) # copy another 2 bytes (not moving ESI or EDI)
# final effect: 78-byte memcpy
On some (but not all) CPUs it would have been efficient to just use rep movsb or rep movsw with appropriate counts, but that's not what the compiler chose in this case. movzx aka AT&T movz is a good way to do narrow loads without partial-register penalties. That's why compilers do it, so they can write a full register even though they're only going to read the low 8 or 16 bits of that reg with a store instruction.
After that copy of a string literal into buf, we have a byte load/store that copies a character with buf. Remember at this point EAX is still pointing at buf, the malloc return value. So it's making a modified copy of the string literal.
<+48>: movzbl 0x11(%eax),%edx
<+52>: mov %dl,0x10(%eax) # buf[16] = buf[17]
Perhaps if the source hadn't defeated constant-propagation, with high enough optimization level the compiler might have just put the final string into .rodata where you could find it, trivializing this bomb phase. :P
Then it stores pointers as stack args for string compare.
<+55>: mov %eax,0x4(%esp) # 2nd arg slot = EAX = buf
<+59>: mov 0x20(%esp),%eax # function arg = user input?
<+63>: mov %eax,(%esp) # first arg slot = our incoming stack arg
<+66>: call 0x80490ca <strings_not_equal>
How to "cheat": looking at the runtime result with GDB
Some bomb labs only let you run the bomb online, on a test server, which would record explosions. You couldn't run it under GDB, only use static disassembly (like objdump -drwC -Mintel). So the test server could record how many failed attempts you had. e.g. like CS 3330 at cs.virginia.edu that I found with google, where full credit requires less than 20 explosions.
Using GDB to examine memory / registers part way through a function makes this vastly easier than only working from static analysis, in fact trivializing this function where the single input is only checked at the very end. e.g. just look at what other arg is being passed to strings_not_equal. (Especially if you use GDB's jump or set $pc = ... commands to skip past the bomb explosion checks.)
Set a breakpoint or single-step to just before the call to strings_not_equal. Use p (char*)$eax to treat EAX as a char* and show you the (0-terminated) C string starting at that address. At that point EAX holds the address of the buffer, as you can see from the store to the stack.
Copy/paste that string result and you're done.
Other phases with multiple numeric inputs typically aren't this easy to cheese with a debugger and do require at least some math, but linked-list phases that requires you to have a sequence of numbers in the right order for list traversal also become trivial if you know how to use a debugger to set registers to make compares succeed as you get to them.
rep movsl copies 32-bit longwords from address %esi to address %edi, incrementing both by 4 each time, a number of times equal to %ecx. Think of it as memcpy(edi, esi, ecx*4).
See https://felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq (it's movsd in Intel notation).
So this is copying 19*4=76 bytes.

Design a function in C from assembly code

I need to design a function in C language to achieve what is written in the machine code. I get the assembly operations in steps, but my function is said to be implemented wrong. I am confused.
This is the disassembled code of the function.
(Hand transcribed from an image, typos are possible
especially in the machine-code. See revision history for the image)
0000000000000000 <ex3>:
0: b9 00 00 00 00 mov 0x0,%ecx
5: eb 1b jmp L2 // 22 <ex3+0x22>
7: 48 63 c1 L1: movslq %ecx,%rax
a: 4c 8d 04 07 lea (%rdi,%rax,1),%r8
e: 45 0f b6 08 movzbl (%r8),%r9d
12: 48 01 f0 add %rsi,%rax
15: 44 0f b6 10 movzbl (%rax),%r10d
19: 45 88 10 mov %r10b,(%r8)
1c: 44 88 08 mov %r9b,(%rax)
1f: 83 c1 01 add $0x1,%ecx
22: 39 d1 L2: cmp %edx,%ecx
24: 7c e1 jl L1 // 7 <ex3+0x7>
26: f3 c3 repz retq
My code(the signature of the function is not given or settled):
#include <assert.h>
int
ex3(int rdi, int rsi,int edx, int r8,int r9 ) {
int ecx = 0;
int rax;
if(ecx>edx){
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
}
return rax;
}
Please explain what cause the bugs if you recognize any.
I am pretty sure it is this: swap two areas of memory:
void memswap(unsigned char *rdi, unsigned char *rsi, int edx) {
int ecx;
for (ecx = 0; ecx < edx; ecx++) {
unsigned char r9 = rdi[ecx];
unsigned char r10 = rsi[ecx];
rdi[ecx] = r10;
rsi[ecx] = r9;
}
}
(Editor's note: this is a partial answer that only addresses the loop structure. It doesn't cover the movzbl byte loads, or the fact that some of these variables are pointers, or type widths. There's room for other answers to cover other parts of the question.)
C supports goto and even though the usage of them is often frowned upon, they are very useful here. Use them to make it as similar to the assembly as possible. This allows you to make sure that the code works before you start introducing more proper control flow mechanisms, like while loops. So I would do something like this:
goto L2;
L1:
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
L2:
if(edx<ecx)
goto L1;
You can easily transform the above code to:
while(edx<ecx) {
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
}
Note that I have not checked if the code within the L1-block and then later the while block is correct or not. (Editor's note: it's missing all the memory accesses). But your jumping was wrong and is now corrected.
What you can do from here (again, assuming that this is correct) is to start trying to see patterns. It seems like ecx is used as some kind of index variable. And the variable rax can be replaced in the beginning. We can do a few other similar changes.This gives us:
int i=0;
while(edx<i) {
// rax = ecx;
// r8 =rdi+i; // r8=rdi+i
// r9 = rdi + i; // r9 = r8
// rax =rsi;
int r10 = rsi; // int r10=rax;
r8 = r10;
rax = r9 = rdi+i;
i++;
}
Here it clearly seems like something is a bit iffy. The while condition is edx<i but i is incremented and not decremented each iteration. That's a good indication that something is wrong. I'm not skilled enough in assembly to figure it out, but at least this is a method you can use.
Just take it step by step.
add $0x1,%ecx is AT&T syntax for incrementing ecx by 1. According to this site using Intel syntax, the result is stored in the first operand. In AT&T syntax, that's the last operand.
One interesting thing to notice is that if we removed the goto L2 statement, this would instead be equivalent to
do {
// Your code
} while(edx<ecx);
A while-loop can be compiled to a do-while-loop with an additional goto. (See Why are loops always compiled into "do...while" style (tail jump)?). It's pretty easy to understand.
In assembly, loops are made with gotos that jump backward in the code. You test and then decide if you want to jump back. So in order to test before the first iteration, you need to jump to the test first. (Compilers also sometimes compile while loops with an if()break at the top and a jmp at the bottom. But only with optimization disabled. See While, Do While, For loops in Assembly Language (emu8086))
Forward jumping is often the result of compiling if statements.
I also just realized that I now have three good ways to use goto. The first two is breaking out of nested loops and releasing resources in opposite order of allocation. And now the third is this, when you reverse engineer assembly.
For those that prefer a .S format for GCC, I used:
ex3:
mov $0x0, %ecx
jmp lpe
lps:
movslq %ecx, %rax
lea (%rdi, %rax, 1), %r8
movzbl (%r8), %r9d
add %rsi, %rax
movzbl (%rax), %r10d
mov %r10b, (%r8)
mov %r9b, (%rax)
add $0x1, %ecx
lpe:
cmp %edx, %ecx
jl lps
repz retq
.data
.text
.global _main
_main:
mov $0x111111111111, %rdi
mov $0x222222222222, %rsi
mov $0x5, %rdx
mov $0x333333333333, %r8
mov $0x444444444444, %r9
call ex3
xor %eax, %eax
ret
you can then compile it with gcc main.S -o main and run objdump -x86-asm-syntax=intel -d main to see it in intel format OR run the resulting main executable in a decompiler.. but meh.. Let's do some manual work..
First I would convert the AT&T syntax to the more commonly known Intel syntax.. so:
ex3:
mov ecx, 0
jmp lpe
lps:
movsxd rax, ecx
lea r8, [rdi + rax]
movzx r9d, byte ptr [r8]
add rax, rsi
movzx r10d, byte ptr [rax]
mov byte ptr [r8], r10b
mov byte ptr [rax], r9b
add ecx, 0x1
lpe:
cmp ecx, edx
jl lps
rep ret
Now I can see clearly that from lps (loop start) to lpe (loop end), is a for-loop.
How? Because first it sets the counter register (ecx) to 0. Then it checks if ecx < edx by doing a cmp ecx, edx followed by a jl (jump if less than).. If it is, it runs the code and increments ecx by 1 (add ecx, 1).. if not, it exists the block..
Thus it looks like: for (int32_t ecx = 0; ecx < edx; ++ecx).. (note that edx is the lower 32-bits of rdx).
So now we translate the rest with the knowledge that:
r10 is a 64-bit register. r10d is the upper 32 bits, r10b is the lower 8 bits.
r9 is a 64-bit register. Same logic as r10 applies.
So we can represent a register as I have below:
typedef union Register
{
uint64_t reg;
struct
{
uint32_t upper32;
uint32_t lower32;
};
struct
{
uint16_t uupper16;
uint16_t ulower16;
uint16_t lupper16;
uint16_t llower16;
};
struct
{
uint8_t uuupper8;
uint8_t uulower8;
uint8_t ulupper8;
uint8_t ullower8;
uint8_t luupper8;
uint8_t lulower8;
uint8_t llupper8;
uint8_t lllower8;
};
} Register;
Whichever is better.. you can choose for yourself..
Now we can start looking at the instructions themselves..
movsxd or movslq moves a 32-bit register into a 64-bit register with a sign extension.
Now we can write the code:
uint8_t* ex3(uint8_t* rdi, uint64_t rsi, int32_t edx)
{
uintptr_t rax = 0;
for (int32_t ecx = 0; ecx < edx; ++ecx)
{
rax = ecx;
uint8_t* r8 = rdi + rax;
Register r9 = { .reg = *r8 }; //zero extend into the upper half of the register
rax += rsi;
Register r10 = { .reg = *(uint8_t*)rax }; //zero extend into the upper half of the register
*r8 = r10.lllower8;
*(uint8_t*)rax = r9.lllower8;
}
return rax;
}
Hopefully I didn't screw anything up..

Compiler using local variables without adjusting RSP

In question Compilers: Understanding assembly code generated from small programs the compiler uses two local variables without adjusting the stack pointer.
Not adjusting RSP for the use of local variables seems not interrupt safe and so the compiler seems to rely on the hardware automatically switching to a system stack when interrupts occur. Otherwise, the first interrupt that came along would push the instruction pointer onto the stack and would overwrite the local variable.
The code from that question is:
#include <stdio.h>
int main()
{
for(int i=0;i<10;i++){
int k=0;
}
}
The assembly code generated by that compiler is:
00000000004004d6 <main>:
4004d6: 55 push rbp
4004d7: 48 89 e5 mov rbp,rsp
4004da: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
4004e1: eb 0b jmp 4004ee <main+0x18>
4004e3: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
4004ea: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1
4004ee: 83 7d f8 09 cmp DWORD PTR [rbp-0x8],0x9
4004f2: 7e ef jle 4004e3 <main+0xd>
4004f4: b8 00 00 00 00 mov eax,0x0
4004f9: 5d pop rbp
4004fa: c3 ret
The local variables are i at [rbp-0x8] and k at [rbp-0x4].
Can anyone shine light on this interrupt problem? Does the hardware indeed switch to a system stack? How? Am I wrong in my understanding?
This is the so called "red zone" of the x86-64 ABI. A summary from wikipedia:
In computing, a red zone is a fixed-size area in a function's stack frame beyond the current stack pointer which is not preserved by that function. The callee function may use the red zone for storing local variables without the extra overhead of modifying the stack pointer. This region of memory is not to be modified by interrupt/exception/signal handlers. The x86-64 ABI used by System V mandates a 128-byte red zone which begins directly under the current value of the stack pointer.
In 64-bit Linux user code it is OK, as long as no more than 128 bytes are used. It is an optimization used most prominently by leaf-functions, i.e. functions which don't call other functions,
If you were to compile the example program as a 64-bit Linux program with GCC (or compatible compiler) using the -mno-red-zone option you'd see code like this generated:
main:
push rbp
mov rbp, rsp
sub rsp, 16; <<============ Observe RSP is now being adjusted.
mov DWORD PTR [rbp-4], 0
.L3:
cmp DWORD PTR [rbp-4], 9
jg .L2
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this godbolt.org link.
For a 32-bit Linux user program it would be a bad thing not to adjust the stack pointer. If you were to compile the code in the question as 32-bit code (using -m32 option) main would appear something like the following code:
main:
push ebp
mov ebp, esp
sub esp, 16; <<============ Observe ESP is being adjusted.
mov DWORD PTR [ebp-4], 0
.L3:
cmp DWORD PTR [ebp-4], 9
jg .L2
mov DWORD PTR [ebp-8], 0
add DWORD PTR [ebp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this gotbolt.org link.

Why does GCC produce stack preservation instructions when they're not necessary?

I'm compiling the following simple demonstration function:
int add(int a, int b) {
return a + b;
}
Naturally this function would be inlined, but let's assume that it's dynamically linked or not inlined for some other reason. With optimization disabled, the compiler produces the expected code:
00000000 <add>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: 8b 45 0c mov eax,DWORD PTR [ebp+0xc]
6: 03 45 08 add eax,DWORD PTR [ebp+0x8]
9: 5d pop ebp
a: c3 ret
Since there are no function calls inside this function, the instructions at 0, 1 and 9 seemingly have no purpose. Since optimization is disabled, this is acceptable.
However, when compiling while optimizing for size with -Os -s, the exact same code is produced. It seems rather wasteful to increase the size of the function by 66% with these options.
Why is the code not optimized to the following?
00000000 <add>:
0: 8b 45 0c mov eax,DWORD PTR [esp+0x8]
3: 03 45 08 add eax,DWORD PTR [esp+0x4]
6: c3 ret
Does the compiler just not consider this worth optimizing or is it related to other details like function alignment?
This is done to preserve the ability of the debugger to step through your code.
If you really want to disable this try -fomit-frame-pointer.
Compiling your above code using -Os -fomit-frame-pointer -S -masm=intel gave this:
.file "frame.c"
.intel_syntax noprefix
.text
.globl _add
.def _add; .scl 2; .type 32; .endef
_add:
mov eax, DWORD PTR [esp+8]
add eax, DWORD PTR [esp+4]
ret
.ident "GCC: (rev0, Built by MinGW-builds project) 4.8.0"
The value of EBP is not known when the function enters. Code could use mov eax,dword ptr [esp+8] and not bother with the BP register, but many debugging tools assume that each local variable is at a fixed offset relative to some register. Even if a compiler could keep track of things that were pushed on the stack and adjust indexing offsets appropriately, debuggers would likely be unable to do so.

What's the point of LEA EAX, [EAX]?

LEA EAX, [EAX]
I encountered this instruction in a binary compiled with the Microsoft C compiler. It clearly can't change the value of EAX. Then why is it there?
It is a NOP.
The following are typcially used as NOP. They all do the same thing but they result in machine code of different length. Depending on the alignment requirement one of them is chosen:
xchg eax, eax = 90
mov eax, eax = 89 C0
lea eax, [eax + 0x00] = 8D 40 00
From this article:
This trick is used by MSVC++ compiler
to emit the NOP instructions of
different length (for padding before
jump targets). For example, MSVC++
generates the following code if it
needs 4-byte and 6-byte padding:
8d6424 00 lea [ebx+00],ebx
; 4-byte padding 8d9b 00000000
lea [esp+00000000],esp ; 6-byte
padding
The first line is marked as "npad 4"
in assembly listings generated by the
compiler, and the second is "npad 6".
The registers (ebx, esp) can be chosen
from the rarely used ones to avoid
false dependencies in the code.
So this is just a kind of NOP, appearing right before targets of jmp instructions in order to align them.
Interestingly, you can identify the compiler from the characteristic nature of such instructions.
LEA EAX, [EAX]
Indeed doesn't change the value of EAX. As far as I understand, it's identical in function to:
MOV EAX, EAX
Did you see it in optimized code, or unoptimized code?

Resources