Why does GCC produce stack preservation instructions when they're not necessary?

Why does GCC produce stack preservation instructions when they're not necessary? - c

I'm compiling the following simple demonstration function:
int add(int a, int b) {
return a + b;
}
Naturally this function would be inlined, but let's assume that it's dynamically linked or not inlined for some other reason. With optimization disabled, the compiler produces the expected code:
00000000 <add>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: 8b 45 0c mov eax,DWORD PTR [ebp+0xc]
6: 03 45 08 add eax,DWORD PTR [ebp+0x8]
9: 5d pop ebp
a: c3 ret
Since there are no function calls inside this function, the instructions at 0, 1 and 9 seemingly have no purpose. Since optimization is disabled, this is acceptable.
However, when compiling while optimizing for size with -Os -s, the exact same code is produced. It seems rather wasteful to increase the size of the function by 66% with these options.
Why is the code not optimized to the following?
00000000 <add>:
0: 8b 45 0c mov eax,DWORD PTR [esp+0x8]
3: 03 45 08 add eax,DWORD PTR [esp+0x4]
6: c3 ret
Does the compiler just not consider this worth optimizing or is it related to other details like function alignment?

This is done to preserve the ability of the debugger to step through your code.
If you really want to disable this try -fomit-frame-pointer.
Compiling your above code using -Os -fomit-frame-pointer -S -masm=intel gave this:
.file "frame.c"
.intel_syntax noprefix
.text
.globl _add
.def _add; .scl 2; .type 32; .endef
_add:
mov eax, DWORD PTR [esp+8]
add eax, DWORD PTR [esp+4]
ret
.ident "GCC: (rev0, Built by MinGW-builds project) 4.8.0"

The value of EBP is not known when the function enters. Code could use mov eax,dword ptr [esp+8] and not bother with the BP register, but many debugging tools assume that each local variable is at a fixed offset relative to some register. Even if a compiler could keep track of things that were pushed on the stack and adjust indexing offsets appropriately, debuggers would likely be unable to do so.

Related

Why would gcc -O3 generate multiple ret instructions? [duplicate]

This question already has an answer here:
Why does GCC emit a repeated `ret`?
(1 answer)
Closed 2 months ago.
I was looking at some recursive function from here:
int get_steps_to_zero(int n)
{
if (n == 0) {
// Base case: we have reached zero
return 0;
} else if (n % 2 == 0) {
// Recursive case 1: we can divide by 2
return 1 + get_steps_to_zero(n / 2);
} else {
// Recursive case 2: we can subtract by 1
return 1 + get_steps_to_zero(n - 1);
}
}
I checked the disassembly in order to check if gcc managed tail-call optimization/unrolling. Looks like it did, though with x86-64 gcc 12.2 -O3 I get a function like this, ending with two ret instructions:
get_steps_to_zero:
xor eax, eax
test edi, edi
jne .L5
jmp .L6
.L10:
mov edx, edi
shr edx, 31
add edi, edx
sar edi
test edi, edi
je .L9
.L5:
add eax, 1
test dil, 1
je .L10
sub edi, 1
test edi, edi
jne .L5
.L9:
ret
.L6:
ret
Godbolt example.
What's the purpose of the multiple returns? Is it a bug?
EDIT
Seems like this appeared from gcc 11.x. When compiling under gcc 10.x, then the function ends like:
.L1:
mov eax, r8d
ret
.L6:
xor r8d, r8d
mov eax, r8d
ret
As in: store result in eax. The 11.x version instead zeroes eax in the beginning of the function then modifies it in the function body, eliminating the need for the extra mov instruction.

This is a manifestation of pass ordering problem. At some point in the optimization pipeline, the two basic blocks ending in ret are not equivalent, then some pass makes them equivalent, but no following pass is capable of collapsing the two equivalent blocks into one.
On Compiler Explorer, you can see how compiler optimization pipeline works by inspecting snapshots of internal representation between passes. For GCC, select "Add New > GCC Tree/RTL" in the compiler pane. Here's your example, with a snapshot immediately preceding the problematic transformation pre-selected in the new pane: https://godbolt.org/z/nTazM5zGG
Towards the end of the dump, you can see the two basic blocks:
65: NOTE_INSN_BASIC_BLOCK 8
77: use ax:SI
66: simple_return
and
43: NOTE_INSN_BASIC_BLOCK 9
5: ax:SI=0
38: use ax:SI
74: NOTE_INSN_EPILOGUE_BEG
75: simple_return
Basically the second block is different in that it sets eax to zero before returning. If you look at the next pass (called "jump2"), you see that it lifts the ax:SI=0 instruction from basic block 9 and basic block 3 to basic block 2, making BB 9 equivalent to BB 8.
If you disable this optimization with -fno-crossjumping, the difference will be carried to the end, making the resulting assembly less surprising.

Conclusion first: This is a deliberate optimization choice by GCC.
If you use GCC locally (gcc -O3 -S) instead of on Godbolt, you can see that there are alignment directives between the two ret instructions:
; top part omitted
.L9:
ret
.p2align 4,,10
.p2align 3
.L6:
ret
.cfi_endproc
The object file, when disassembled, includes an NOP in that padding area:
8: 75 13 jne 1d <get_steps_to_zero+0x1d>
a: eb 24 jmp 30 <get_steps_to_zero+0x30>
c: 0f 1f 40 00 nopl 0x0(%rax)
<...>
2b: 75 f0 jne 1d <get_steps_to_zero+0x1d>
2d: c3 ret
2e: 66 90 xchg %ax,%ax
30: c3 ret
The second ret instruction is aligned to a 16-byte boundary whereas the first one isn't. This allows the processor to load the instruction faster when used as a jump target from a distant source. Subsequent C return statements, however, are close enough to the first ret instruction such that they will not benefit from jumping to aligned targets.
This alignment is even more noticeable on my Zen 2 CPU with -mtune=native, with more padding bytes added:
29: 75 f2 jne 1d <get_steps_to_zero+0x1d>
2b: c3 ret
2c: 0f 1f 40 00 nopl 0x0(%rax)
30: c3 ret

Shellcode Segmentation Fault error when run from exploitable program

BITS 64
section .text
global _start
_start:
jmp short two
one:
pop rbx
xor al,al
xor cx,cx
mov al,8
mov cx,0755
int 0x80
xor al,al
inc al
xor bl,bl
int 0x80
two:
call one
db 'H'`
This is my assembly code.
Then I used two commands. "nasm -f elf64 newdir.s -o newdir.o" and "ld newdir.o -o newdir".I run ./newdir and worked fine but when I extracted op code and tried to test this shellcode using following c program . It is not working(no segmentation fault).I have compiled using cmd gcc newdir -z execstack
#include <stdio.h>
char sh[]="\xeb\x16\x5b\x30\xc0\x66\x31\xc9\xb0\x08\x66\xb9\xf3\x02\xcd\x80\x30\xc0\xfe\xc0\x30\xdb\xcd\x80\xe8\xe5\xff\xff\xff\x48";
void main(int argc, char **argv)
{
int (*func)();
func = (int (*)()) sh;
(int)(*func)();
}
objdump -d newdir
newdir: file format elf64-x86-64
Disassembly of section .text:
0000000000400080 <_start>:
400080: eb 16 jmp 400098 <two>
0000000000400082 <one>:
400082: 5b pop %rbx
400083: 30 c0 xor %al,%al
400085: 66 31 c9 xor %cx,%cx
400088: b0 08 mov $0x8,%al
40008a: 66 b9 f3 02 mov $0x2f3,%cx
40008e: cd 80 int $0x80
400090: 30 c0 xor %al,%al
400092: fe c0 inc %al
400094: 30 db xor %bl,%bl
400096: cd 80 int $0x80
0000000000400098 <two>:
400098: e8 e5 ff ff ff callq 400082 <one>
40009d: 48 rex.W
when I run ./a.out , I am getting something like in photo. I am attaching photo because I cant explain what is happening.image
P.S- My problem is resolved. But I wanted to know where things was going wrong. So I used debugger and the result is below
`
(gdb) list
1 char shellcode[] = "\xeb\x16\x5b\x30\xc0\x66\x31\xc9\xb0\x08\x66\xb9\xf3\x02\xcd\x80\x30\xc0\xfe\xc0\x30\xdb\xcd\x80\xe8\xe5\xff\xff\xff\x48";
2 int main (int argc, char **argv)
3 {
4 int (*ret)();
5 ret = (int(*)())shellcode;
6
7 (int)(*ret)();
8 } (gdb) disassemble main
Dump of assembler code for function main:
0x00000000000005fa <+0>: push %rbp
0x00000000000005fb <+1>: mov %rsp,%rbp
0x00000000000005fe <+4>: sub $0x20,%rsp
0x0000000000000602 <+8>: mov %edi,-0x14(%rbp)
0x0000000000000605 <+11>: mov %rsi,-0x20(%rbp)
0x0000000000000609 <+15>: lea 0x200a20(%rip),%rax # 0x201030 <shellcode>
0x0000000000000610 <+22>: mov %rax,-0x8(%rbp)
0x0000000000000614 <+26>: mov -0x8(%rbp),%rdx
0x0000000000000618 <+30>: mov $0x0,%eax
0x000000000000061d <+35>: callq *%rdx
0x000000000000061f <+37>: mov $0x0,%eax
0x0000000000000624 <+42>: leaveq
0x0000000000000625 <+43>: retq
End of assembler dump.
(gdb) b 7
Breakpoint 1 at 0x614: file test.c, line 7.
(gdb) run
Starting program: /root/Desktop/Progs/shell/a.out
Breakpoint 1, main (argc=1, argv=0x7fffffffe2b8) at test.c:7
7 (int)(*ret)();
(gdb) info registers rip
rip 0x555555554614 0x555555554614 <main+26>
(gdb) x/5i $rip
=> 0x555555554614 <main+26>: mov -0x8(%rbp),%rdx
0x555555554618 <main+30>: mov $0x0,%eax
0x55555555461d <main+35>: callq *%rdx
0x55555555461f <main+37>: mov $0x0,%eax
0x555555554624 <main+42>: leaveq
(gdb) s
(Control got stuck here, so i pressed ctrl+c)
^C
Program received signal SIGINT, Interrupt.
0x0000555555755048 in shellcode ()
(gdb) x/5i 0x0000555555755048
=> 0x555555755048 <shellcode+24>: callq 0x555555755032 <shellcode+2>
0x55555575504d <shellcode+29>: rex.W add %al,(%rax)
0x555555755050: add %al,(%rax)
0x555555755052: add %al,(%rax)
0x555555755054: add %al,(%rax)
Here is the debugging information. I am not able to find where the control goes wrong.If need more info please ask.

Below is a working example using x86-64; which could be further optimized for size. That last 0x00 null is ok for the purpose of executing the shellcode.
assemble & link:
$ nasm -felf64 -g -F dwarf pushpam_001.s -o pushpam_001.o && ld pushpam_001.o -o pushpam_001
Code:
BITS 64
section .text
global _start
_start:
jmp short two
one:
pop rdi ; pathname
xor rax, rax
add al, 85 ; creat syscall 64-bit Linux
xor rsi, rsi
add si, 0755 ; mode - octal
syscall
xor rax, rax
add ax, 60
xor rdi, rdi
syscall
two:
call one
db 'H',0
objdump:
pushpam_001: file format elf64-x86-64
0000000000400080 <_start>:
400080: eb 1c jmp 40009e <two>
0000000000400082 <one>:
400082: 5f pop rdi
400083: 48 31 c0 xor rax,rax
400086: 04 55 add al,0x55
400088: 48 31 f6 xor rsi,rsi
40008b: 66 81 c6 f3 02 add si,0x2f3
400090: 0f 05 syscall
400092: 48 31 c0 xor rax,rax
400095: 66 83 c0 3c add ax,0x3c
400099: 48 31 ff xor rdi,rdi
40009c: 0f 05 syscall
000000000040009e <two>:
40009e: e8 df ff ff ff 48 00
.....H.
encoding extraction: There are many other ways to do this.
$ for i in `objdump -d pushpam_001 | grep "^ " | cut -f2`; do echo -n '\x'$i; done; echo
\xeb\x1c\x5f\x48\x31\xc0\x04\x55\x48\x31\xf6\x66\x81\xc6\xf3\x02\x0f\x05\x48\x31\xc0\x66\x83\xc0\x3c\x48\x31\xff\x0f\x05\xe8\xdf\xff\xff\xff\x48\x00\x.....H.
C shellcode.c - partial
...
unsigned char code[] = \
"\xeb\x1c\x5f\x48\x31\xc0\x04\x55\x48\x31\xf6\x66\x81\xc6\xf3\x02\x0f\x05\x48\x31\xc0\x66\x83\xc0\x3c\x48\x31\xff\x0f\x05\xe8\xdf\xff\xff\xff\x48\x00";
...
final:
./shellcode
--wxrw---t 1 david david 0 Jan 31 12:25 H

If int 0x80 in 64-bit code was the only problem, building your C test with gcc -fno-pie -no-pie would have worked, because then char sh[] would be in the low 32 bits of virtual address space, so system calls that truncate pointers to 32 bits would still work.
Run your program under strace to see what system calls it actually makes. (Except that strace decodes int 0x80 syscalls incorrectly in 64-bit code, decoding as if you'd used the 64-bit syscall ABI. The call numbers and arg registers are different.) But at least you can see the system-call return values (which will be -EFAULT for 32-bit creat with a truncated 64-bit pointer.)
You can also just gdb to single-step and check the system call return values. Having strace decode the system-call inputs is really nice, though, so I'd recommend porting your code to use the 64-bit ABI, and then it would just work.
Also, it would actually be able to exploit 64-bit processes where the buffer overflow is in memory at an address outside the low 32 bits. (e.g. like the stack). So yes, you should really stop using int 0x80 or stick to 32-bit code.
You're also depending on registers being zeroed before your code runs, like they are on process startup, but not when called from anywhere else.
xor al,al before mov al,8 is completely pointless, because xor-zeroing al doesn't clear upper bytes. Writing 32-bit registers clears the upper 32, but not writing 8 or 16 bit registers. And if it did, you wouldn't need the xor-zeroing before using mov which is also write-only.
If you want to set RAX=8 without any zero bytes in the machine code, you can
push 8 / pop rax (3 bytes)
xor eax,eax / mov al,8 (4 bytes)
Or given a zeroed rcx register, lea eax, [rcx+8] (3 bytes)
Setting CX to 0755 isn't so simple, because the constant doesn't fit in an imm8. Your 16-bit mov is a good choice (or would have been if you'd zeroed rcx first.
xor ecx,ecx
lea eax, [rcx+8] ; SYS_creat = 8 from unistd_32.h
mov cx, 0755 ; mode
int 0x80 ; invoke 32-bit ABI
xor ebx,ebx
lea eax, [rbx+1] ; SYS_exit = 1
int 0x80

Compiler using local variables without adjusting RSP

In question Compilers: Understanding assembly code generated from small programs the compiler uses two local variables without adjusting the stack pointer.
Not adjusting RSP for the use of local variables seems not interrupt safe and so the compiler seems to rely on the hardware automatically switching to a system stack when interrupts occur. Otherwise, the first interrupt that came along would push the instruction pointer onto the stack and would overwrite the local variable.
The code from that question is:
#include <stdio.h>
int main()
{
for(int i=0;i<10;i++){
int k=0;
}
}
The assembly code generated by that compiler is:
00000000004004d6 <main>:
4004d6: 55 push rbp
4004d7: 48 89 e5 mov rbp,rsp
4004da: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
4004e1: eb 0b jmp 4004ee <main+0x18>
4004e3: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
4004ea: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1
4004ee: 83 7d f8 09 cmp DWORD PTR [rbp-0x8],0x9
4004f2: 7e ef jle 4004e3 <main+0xd>
4004f4: b8 00 00 00 00 mov eax,0x0
4004f9: 5d pop rbp
4004fa: c3 ret
The local variables are i at [rbp-0x8] and k at [rbp-0x4].
Can anyone shine light on this interrupt problem? Does the hardware indeed switch to a system stack? How? Am I wrong in my understanding?

This is the so called "red zone" of the x86-64 ABI. A summary from wikipedia:
In computing, a red zone is a fixed-size area in a function's stack frame beyond the current stack pointer which is not preserved by that function. The callee function may use the red zone for storing local variables without the extra overhead of modifying the stack pointer. This region of memory is not to be modified by interrupt/exception/signal handlers. The x86-64 ABI used by System V mandates a 128-byte red zone which begins directly under the current value of the stack pointer.
In 64-bit Linux user code it is OK, as long as no more than 128 bytes are used. It is an optimization used most prominently by leaf-functions, i.e. functions which don't call other functions,
If you were to compile the example program as a 64-bit Linux program with GCC (or compatible compiler) using the -mno-red-zone option you'd see code like this generated:
main:
push rbp
mov rbp, rsp
sub rsp, 16; <<============ Observe RSP is now being adjusted.
mov DWORD PTR [rbp-4], 0
.L3:
cmp DWORD PTR [rbp-4], 9
jg .L2
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this godbolt.org link.
For a 32-bit Linux user program it would be a bad thing not to adjust the stack pointer. If you were to compile the code in the question as 32-bit code (using -m32 option) main would appear something like the following code:
main:
push ebp
mov ebp, esp
sub esp, 16; <<============ Observe ESP is being adjusted.
mov DWORD PTR [ebp-4], 0
.L3:
cmp DWORD PTR [ebp-4], 9
jg .L2
mov DWORD PTR [ebp-8], 0
add DWORD PTR [ebp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this gotbolt.org link.

does gcc preserve callee save registers

As you might have guessed the question is does gcc automaticly saves callee-save registers or should I do it by myself? I thought that gcc would do that for me but when I wrote this code
void foo(void) {
__asm__ volatile ("mov $123, %rbx");
}
void main(void) {
foo();
}
after gcc a.c && objdump -d a.out I saw this
00000000004004f6 <foo>:
4004f6: 55 push %rbp
4004f7: 48 89 e5 mov %rsp,%rbp
4004fa: 48 c7 c3 7b 00 00 00 mov $0x7b,%rbx
400501: 90 nop
400502: 5d pop %rbp
400503: c3 retq
0000000000400504 <main>:
400504: 55 push %rbp
400505: 48 89 e5 mov %rsp,%rbp
400508: e8 e9 ff ff ff callq 4004f6 <foo>
40050d: 90 nop
40050e: 5d pop %rbp
40050f: c3 retq
According to x86-64 ABI %rbx is a callee-save register but in this code gcc did not save it in foo before modifying. Is it just because I don't use %rbx in main function after calling foo() or it's bacause gcc does not provide such garanties and I have to save it by myself in foo before modifying?

Gcc will automatically save and restore all callee-save registers THAT IT KNOWS ARE USED. It knows about registers it uses itself, but it will only know about registers used in inline assembly if you tell it. That's what the 'clobbers' list is for:
void foo(void) {
__asm__ volatile ("mov $123, %%rbx" : : : "rbx");
}
Now the compiler knows that you are using/modifying rbx, so it will save it if it needs to.
Note that you really want to do it this way rather than trying to save it yourself, as this way it will only be saved once if gcc also wants to use the register for something in this function.

It would be pretty boggy code that saves and restores every register by rote. The compiler saves registers within the C code it has compiled, but you are on your own here, gcc has no idea what your intentions are.
The assembler allows you to get under the hood, but it won't replace the spark plugs for you.

The addressing of stack variables

I'm analyzing the disassembly of the following (very simple) C program in GDB on X86_64.
int main()
{
int a = 5;
int b = a + 6;
return 0;
}
I understand that in X86_64 the stack grows down. That is the top of the stack has a lower address than the bottom of the stack. The assembler from the above program is as follows:
Dump of assembler code for function main:
0x0000000000400474 <+0>: push %rbp
0x0000000000400475 <+1>: mov %rsp,%rbp
0x0000000000400478 <+4>: movl $0x5,-0x8(%rbp)
0x000000000040047f <+11>: mov -0x8(%rbp),%eax
0x0000000000400482 <+14>: add $0x6,%eax
0x0000000000400485 <+17>: mov %eax,-0x4(%rbp)
0x0000000000400488 <+20>: mov $0x0,%eax
0x000000000040048d <+25>: leaveq
0x000000000040048e <+26>: retq
End of assembler dump.
I understand that:
We push the base pointer on the stack.
We then copy the value of the stack pointer to the base pointer.
We then copy the value 5 into the address -0x8(%rbp). Since in an int is 4 bytes shouldn't this be at next address in the stack which is -0x4(%rbp) rather than -0x8(%rbp)?.
We then copy the value at the variable a into %eax, add 6 and then copy the value into the address at -0x4(%rbp).
Using the this graphic for reference:
(source: thegreenplace.net)
it looks like the stack has the following contents:
|--------------|
| rbp | <-- %rbp
| 11 | <-- -0x4(%rbp)
| 5 | <-- -0x8(%rbp)
when I was expecting this:
|--------------|
| rbp | <-- %rbp
| 5 | <-- -0x4(%rbp)
| 11 | <-- -0x8(%rbp)
which seems to be the case in 7-understanding-c-by-learning-assembly where they show the assembly:
(gdb) disassemble
Dump of assembler code for function main:
0x0000000100000f50 <main+0>: push %rbp
0x0000000100000f51 <main+1>: mov %rsp,%rbp
0x0000000100000f54 <main+4>: mov $0x0,%eax
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx
0x0000000100000f6a <main+26>: add $0x6,%ecx
0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)
0x0000000100000f73 <main+35>: pop %rbp
0x0000000100000f74 <main+36>: retq
End of assembler dump.
Why is the value of b is being put into a higher memory address in the stack than a when a is clearly declared and initialized first?

The value of b is put on the stack wherever the compiler feels like it. You have no influence over it. And you shouldn't. It's possible that the order will change between minor versions of the compiler because some internal data structure was changed or some code rearranged. Some compilers will even randomize the layout of the stack on different compilations on purpose because it can make certain bugs harder to exploit.
In fact, the compiler might not use the stack at all. There's no need to. Here's the disassembly of the same program compiled with some optimizations enabled:
$ cat > foo.c
int main()
{
int a = 5;
int b = a + 6;
return 0;
}
$ cc -O -c foo.c
$ objdump -S foo.o
foo.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
$
With some simple optimizations the compiler figured out that you don't use the variable 'b', so there's no need to calculate it. And because of that you don't use the variable 'a' either, so there's no need to assign it. Only a compilation with no optimizations (or a very bad compiler) will put anything on the stack here. And even if you use the values basic optimizations will put them into registers because touching the stack is expensive.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight