C allocated space size on stack for an array - c

I have a simple program called demo.c which allocates space for a char array with the length of 8 on the stack
#include<stdio.h>
main()
{
char buffer[8];
return 0;
}
I thought that 8 bytes will be allocated from stack for the eight chars but if I check this in gdb there are 10 bytes subtracted from the stack.
I compile the the program with this command on my Ubuntu 32 bit machine:
$ gcc -ggdb -o demo demo.c
Then I analyze the program with:
$ gdb demo
$ disassemble main
(gdb) disassemble main
Dump of assembler code for function main:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
0x0804840d <+9>: mov %gs:0x14,%eax
0x08048413 <+15>: mov %eax,0xc(%esp)
0x08048417 <+19>: xor %eax,%eax
0x08048419 <+21>: mov $0x0,%eax
0x0804841e <+26>: mov 0xc(%esp),%edx
0x08048422 <+30>: xor %gs:0x14,%edx
0x08048429 <+37>: je 0x8048430 <main+44>
0x0804842b <+39>: call 0x8048340 <__stack_chk_fail#plt>
0x08048430 <+44>: leave
0x08048431 <+45>: ret
End of assembler dump.
0x0804840a <+6>: sub $0x10,%esp says, that there are 10 bytes allocated from the stack right?
Why are there 10 bytes allocated and not 8?

No, 0x10 means it's hexadecimal, i.e. 1016, which is 1610 bytes in decimal.
Probably due to alignment requirements for the stack.

Please note that the constant $0x10 is in hexadecimal this is equal to 16 byte.
Take a look at the machine code:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
...
0x08048430 <+44>: leave
0x08048431 <+45>: ret
As you can see before we subtract 16 from the esp we ensure to make esp pointing to a 16 byte aligned address first (take a look at the and $0xfffffff0,%esp instruction).
I guess the compiler try to respect the alignment so he simply reserves 16 byte as well. It does not matter anyway because 8 byte fit into 16 byte very well.

sub $0x10, %esp is saying that there are 16 bytes on the stack, not 10 since 0x is hexadecimal notation.
The amount of space for the stack is completely dependent on the compiler. In this case it's most like an alignment issue where the alignment is 16 bytes and you've requested 8, so it gets increased to 16.
If you requested 17 bytes, it would most likely have been sub $0x20, %esp or 32 bytes instead of 17.

(I skipped over some things the other answers explain in more detail).
You compiled with -O0, so gcc is operating in a super-simple way that tells you something about compiler internals, but little about how to make good code from C.
gcc is keeping the stack 16B-aligned at all times. The 32bit SysV ABI only guarantees 4B stack alignment, but GNU/Linux systems actually assume and maintain gcc's default -mpreferred-stack-boundary=4 (16B-aligned).
Your version of gcc also defaults to using -fstack-protector, so it checks for stack-smashing in functions with local char arrays with 4 or more elements:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to
functions with
vulnerable objects. This includes functions that call "alloca", and functions with buffers larger than 8 bytes. The guards
are
initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is
printed and
the program exits.
For some reason, this is actually kicking in with char arrays >= 4B, but not with integer arrays. (At least, not when they're unused!). char pointers can alias anything, which may have something to do with it.
See the code on godbolt, with asm output. Note how main is special: it uses andl $-16, %esp to align the stack on entry to main, but other functions assume the stack was 16B-aligned before the call instruction that called them. So they'll typically sub $24, %esp, after pushing %ebp. (%ebp and the return address are 8B total, so the stack is 8B away from being 16B-aligned). This leaves room for the stack-protector canary.
The 32bit SysV ABI only requires arrays to be aligned to the natural alignment of their elements, so this 16B alignment for the char array is just what the compiler decided to do in this case, not something you can count on.
The 64bit ABI is different:
An array uses the same alignment as its elements, except that a local
or global array variable of length at least 16 bytes or a C99
variable-length array variable always has alignment of at least 16
bytes
(links from the x86 tag wiki)
So you can count on char buf[1024] being 16B-aligned on SysV, allowing you to use SSE aligned loads/stores on it.

Related

Difference in x86-32 and x64 Assembly stack allocation for a fixed-size buffer with unoptimized C (GCC)

Doing some basic disassembly and have noticed that the buffer is being given additional buffer space for some reason although what i am looking at in a tutorial uses the same code but is only given the correct (500) chars in length. Why is this?
My code:
#include <stdio.h>
#include <string.h>
int main (int argc, char** argv){
char buffer[500];
strcpy(buffer, argv[1]);
return 0;
}
compiled with GCC, the dissembled code is:
0x0000000000001139 <+0>: push %rbp
0x000000000000113a <+1>: mov %rsp,%rbp
0x000000000000113d <+4>: sub $0x210,%rsp
0x0000000000001144 <+11>: mov %edi,-0x204(%rbp)
0x000000000000114a <+17>: mov %rsi,-0x210(%rbp)
0x0000000000001151 <+24>: mov -0x210(%rbp),%rax
0x0000000000001158 <+31>: add $0x8,%rax
0x000000000000115c <+35>: mov (%rax),%rdx
0x000000000000115f <+38>: lea -0x200(%rbp),%rax
0x0000000000001166 <+45>: mov %rdx,%rsi
0x0000000000001169 <+48>: mov %rax,%rdi
0x000000000000116c <+51>: call 0x1030 <strcpy#plt>
0x0000000000001171 <+56>: mov $0x0,%eax
0x0000000000001176 <+61>: leave
0x0000000000001177 <+62>: ret
However, this video https://www.youtube.com/watch?v=1S0aBV-Waeo clearly only has 500 bytes assigned
Why is this this the case as the only difference I can see here is one is 32-bit and another (mine) is on x86-64.
500 is not a multiple of 16.
The x86-64 ABI (application binary interface) requires the stack pointer to be a multiple of 16 whenever a call instruction is about to happen. (Since call pushes an 8-byte return address, this means the stack pointer is always congruent to 8, mod 16, when control reaches the first instruction of a called function.) For the code shown, it is convenient for the compiler to achieve this requirement by increasing the value it uses in the sub instruction, making it be a multiple of 16.
The x86-32 ABI did not make this requirement, so there was no reason for the compiler used in the video to increase the size of the stack frame.
Note that you appear to have compiled your code without optimization. I get this at -O2:
0x0000000000000000 <+0>: sub $0x208,%rsp
0x0000000000000007 <+7>: mov 0x8(%rsi),%rsi
0x000000000000000b <+11>: mov %rsp,%rdi
0x000000000000000e <+14>: call <strcpy#PLT>
0x0000000000000013 <+19>: xor %eax,%eax
0x0000000000000015 <+21>: add $0x208,%rsp
0x000000000000001c <+28>: ret
The stack adjustment is still somewhat larger than the size of the array, but not as big as what you had, and no longer a multiple of 16; the difference is that with optimization on, the frame pointer is eliminated, so %rbp does not need to be saved and restored, and so the stack pointer is not a multiple of 16 at the point of the sub instruction.
(Incidentally, there is no requirement anywhere for a stack frame to be as small as possible. "Quality of implementation" dictates that it should be as small as possible, but for various reasons it's quite common for the compiler to miss that target. In my optimized code dump, I don't see any reason why the immediate operand to sub and add couldn't have been 0x1f8 (504).

Why does the compiler generate such code when initializing a volatile array?

I have the following program that enables the alignment check (AC) bit in the x86 processor flags register in order to trap unaligned memory accesses. Then the program declares two volatile variables:
#include <assert.h>
int main(void)
{
#ifndef NOASM
__asm__(
"pushf\n"
"orl $(1<<18),(%esp)\n"
"popf\n"
);
#endif
volatile unsigned char foo[] = { 1, 2, 3, 4, 5, 6 };
volatile unsigned int bar = 0xaa;
return 0;
}
If I compile this, the code generated initially does the
obvious things like setting up the stack and creating the array of chars by moving the values 1, 2, 3, 4, 5, 6 onto the stack:
/tmp ➤ gcc test3.c -m32
/tmp ➤ gdb ./a.out
(gdb) disassemble main
0x0804843d <+0>: push %ebp
0x0804843e <+1>: mov %esp,%ebp
0x08048440 <+3>: and $0xfffffff0,%esp
0x08048443 <+6>: sub $0x20,%esp
0x08048446 <+9>: mov %gs:0x14,%eax
0x0804844c <+15>: mov %eax,0x1c(%esp)
0x08048450 <+19>: xor %eax,%eax
0x08048452 <+21>: pushf
0x08048453 <+22>: orl $0x40000,(%esp)
0x0804845a <+29>: popf
0x0804845b <+30>: movb $0x1,0x16(%esp)
0x08048460 <+35>: movb $0x2,0x17(%esp)
0x08048465 <+40>: movb $0x3,0x18(%esp)
0x0804846a <+45>: movb $0x4,0x19(%esp)
0x0804846f <+50>: movb $0x5,0x1a(%esp)
0x08048474 <+55>: movb $0x6,0x1b(%esp)
0x08048479 <+60>: mov 0x16(%esp),%eax
0x0804847d <+64>: mov %eax,0x10(%esp)
0x08048481 <+68>: movzwl 0x1a(%esp),%eax
0x08048486 <+73>: mov %ax,0x14(%esp)
0x0804848b <+78>: movl $0xaa,0xc(%esp)
0x08048493 <+86>: mov $0x0,%eax
0x08048498 <+91>: mov 0x1c(%esp),%edx
0x0804849c <+95>: xor %gs:0x14,%edx
0x080484a3 <+102>: je 0x80484aa <main+109>
0x080484a5 <+104>: call 0x8048310 <__stack_chk_fail#plt>
0x080484aa <+109>: leave
0x080484ab <+110>: ret
However at main+60 it does something strange: it moves the array of 6 bytes to another part of the stack: the data is moved one 4-byte word at a time in registers. But the bytes start at offset 0x16, which is not aligned, so the program will crash when attempting to perform the mov.
So I've two questions:
Why is the compiler emitting code to copy the array to another part of the stack? I assumed volatile would skip every optimization and always perform memory accesses. Maybe volatile vars are required to always be accessed as whole words, and so the compiler would always use temporary registers to read/write whole words?
Why does the compiler not put the char array at an aligned address if it later intends to do these mov calls? I understand that x86 is normally safe with unaligned accesses, and on modern processors it does not even carry a performance penalty; however in all other instances I see the compiler trying to avoid generating unaligned accesses, since they are considered, AFAIK, an unspecified behavior in C. My guess is that, since later it provides a properly aligned pointer for the copied array on the stack, it just does not care about alignment of the data used only for initialization in a way which is invisible to the C program?
If my hypotheses above are right, it means that I cannot expect an x86 compiler to always generate aligned accesses, even if the compiled code never attempts to perform unaligned accesses itself, and so setting the AC flag is not a practical way to detect parts of the code where unaligned accesses are performed.
EDIT: After further research I can answer most of this myself. In an attempt to make progress, I added an option in Redis to set the AC flag and otherwise run normally. I found that this approach is not viable: the process immediately crashes inside libc: __mempcpy_sse2 () at ../sysdeps/x86_64/memcpy.S:83. I assume that the whole x86 software stack simply does not really care about misalignment since it is handled very well by this architecture. Thus it is not practical to run with the AC flag set.
So the answer to question 2 above is that, like the rest of the software stack, the compiler is free to do as it pleases, and relocate things on the stack without caring about alignment, so long as the behavior is correct from the perspective of the C program.
The only question left to answer, is why with volatile, is a copy made in a different part of the stack? My best guess is that the compiler is attempting to access whole words in variables declared volatile even during initialization (imagine if this address was mapped to an I/O port), but I'm not sure.
You're compiling without optimization, so the compiler is generating straight-forward code without worrying about how inefficient it is. So it first creates the initializer { 1, 2, 3, 4, 5, 6 } in temp space on the stack, and it then copies that into the space it allocated for foo.
The compiler populates the array in a working storage area, one byte at a time, which is not atomic. It then moves the entire array to its final resting place using an atomic MOVZ instruction (the atomicity is implicit when the target address is naturally aligned).
The write has to be atomic because the compiler must assume (due to the volatile keyword) that the array can be accessed at any time by anyone else.

Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9?

I have recently exploited a dangerous program and found something interesting about the difference between versions of gcc on x86-64 architecture.
Note:
Wrongful usage of gets is not the issue here.
If we replace gets with any other functions, the problem doesn't change.
This is the source code I use:
#include <stdio.h>
int main()
{
char buf[16];
gets(buf);
return 0;
}
I use gcc.godbolt.org to disassemble the program with flag -m32 -fno-stack-protector -z execstack -g.
At the disassembled code, when gcc with version >= 4.9.0:
lea ecx, [esp+4] # begin of main
and esp, -16
push DWORD PTR [ecx-4] # push esp
push ebp
mov ebp, esp
/* between these comment is not related to the question
push ecx
sub esp, 20
sub esp, 12
lea eax, [ebp-24]
push eax
call gets
add esp, 16
mov eax, 0
*/
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret # end of main
But gcc with version < 4.9.0 just:
push ebp # begin of main
mov ebp, esp
/* between these comment is not related to the question
and esp, -16
sub esp, 32
lea eax, [esp+16]
mov DWORD PTR [esp], eax
call gets
mov eax, 0
*/
leave
ret # end of main
My question is: What is the causes of this difference on the disassembled code and its benefits? Does it have a name for this technique?
I can't say for sure without the actual values in:
and esp, 0xXX # XX is a number
but this looks a lot like extra code to align the stack to a larger value than the ABI requires.
Edit: The value is -16, which is 32-bit 0xFFFFFFF0 or 64-bit 0xFFFFFFFFFFFFFFF0 so this is indeed stack alignment to 16 bytes, likely meant for use of SSE instructions. As mentioned in comments, there is more code in the >= 4.9.0 version because it also aligns the frame pointer instead of only the stack pointer.
The i386 ABI, used for 32-bit programs, imposes that a process, immediately after loaded, has to have the stack aligned on 32-bit values:
%esp Performing its usual job, the stack pointer holds the address of the
bottom of the stack, which is guaranteed to be word aligned.
confront this with the x86_64 ABI1 used for 64-bit programs:
%rsp The stack pointer holds the address of the byte with lowest address which
is part of the stack. It is guaranteed to be 16-byte aligned at process entry
The opportunity gave by the new AMD's 64-bit technology to rewrite the old i386 ABI allow a number of optimizations that were lacking due to backward compatibility, among these a bigger (stricter?) stack alignment.
I won't dwell on the benefits of stack alignment but it suffices to say that if a 4-byte alignment was good, so is a 16-byte one.
So much that it is worth spending some instructions aligning the stack.
That's what GCC 4.9.0+ does, it aligns the stack at 16-bytes.
That explains the and esp, -16 but not the other instructions.
Aligning the stack with and esp, -16 is the fastest way to do it when the compiler only knows that the stack is 4-byte aligned (since esp MOD 16 can be 0, 4, 8 or 12).
However it is a destructive method, the compiler loses the original esp value.
But now it comes the chicken or the egg problem: if we save the original esp on the stack before aligning the stack, we lose it because we don't know how far the stack pointer is lowered by the alignment. If we save it after the alignment, well, we can't. We lost it in the alignment.
So the only possible solution is to save it in a register, align the stack and then save said register on the stack.
;Save the stack pointer in ECX, actually is ESP+4 but still does
lea ecx, [esp+4] #ECX = ESP+4
;Align the stack
and esp, -16 #This lowers ESP by 0, 4, 8 or 12
;IGNORE THIS FOR NOW
push DWORD PTR [ecx-4]
;Usual prolog
push ebp
mov ebp, esp
;Save the original ESP (before alignment), actually is ESP+4 but OK
push ecx
GCC saves esp+4 in ecx, I don't know why2 but this values still does the trick.
The only mystery left is the push DWORD PTR [ecx-4].
But it turns out to be a simple mystery: for debugging purposes GCC pushes the return addresses just before the old frame pointer (before push ebp), this is where 32-bit tools expect it to be.
Since ecx=esp_o+4, where esp_o is the original stack pointer pre-alignment, [ecx-4] = [esp_o] = return address.
Note that now the stack is at 12 bytes modulo 16, thus the local variable area must be of size 16*k+4 to have the stack aligned at 16-byte again.
In your example k is 1 and the area is of 20 bytes in size.
The subsequent sub esp, 12 is to align the stack for the gets function (the requirement is to have the stack aligned at the function call).
Finally, the code
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret
The first instruction is copy-paste error.
One could check it out or simply reason that
if it were there the [ebp-4] would be below the stack pointer (and there is no red zone for the i386 ABI).
The rest is just undoing what's is done in the prolog:
;Get the original stack pointer
mov ecx, DWORD PTR [ebp-4] ;ecx = esp_o+4
;Standard epilog
leave ;mov esp, ebp / pop ebp
;The stack pointer points to the copied return address
;Restore the original stack pointer
lea esp, [ecx-4] ;esp = esp_o
ret
GCC has to first get the original stack pointer (+4) saved on the stack, then restore the old frame pointer (ebp) and finally, restore the original stack pointer.
The return address is on the top of the stack when lea esp, [ecx-4] is executed, so in theory GCC could just return but it has to restore the original esp because main is not the first function to be executed in a C program, so it cannot leave the stack unbalanced.
1 This is not the latest version but the text quoted went unchanged in the successive editions.
2 This has been discussed here on SO but I can't remember if in some comment or in an answer.

Smashing the stack example3 ala Aleph One

I've reproduced Example 3 from Smashing the Stack for Fun and Profit on Linux x86_64. However I'm having trouble understanding what is the correct number of bytes that should be incremented to the return address in order to skip past the instruction:
0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)
which is where I think the x = 1 instruction is. I've written the following:
#include <stdio.h>
void fn(int a, int b, int c) {
char buf1[5];
char buf2[10];
int *ret;
ret = buf1 + 24;
(*ret) += 7;
}
int main() {
int x;
x = 0;
fn(1, 2, 3);
x = 1;
printf("%d\n", x);
}
and disassembled it in gdb. I have disabled address randomization and compiled the program with the -fno-stack-protector option.
Question 1
I can see from the disassembler output below that I want to skip past the instruction at address 0x0000000000400595: both the return address from callq <fn> and the address of the movl instruction. Therefore, if the return address is 0x0000000000400595, and the next instruction is 0x000000000040059c, I should add 7 bytes to the return address?
0x0000000000400572 <+0>: push %rbp
0x0000000000400573 <+1>: mov %rsp,%rbp
0x0000000000400576 <+4>: sub $0x10,%rsp
0x000000000040057a <+8>: movl $0x0,-0x4(%rbp)
0x0000000000400581 <+15>: mov $0x3,%edx
0x0000000000400586 <+20>: mov $0x2,%esi
0x000000000040058b <+25>: mov $0x1,%edi
0x0000000000400590 <+30>: callq 0x40052d <fn>
0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)
0x000000000040059c <+42>: mov -0x4(%rbp),%eax
0x000000000040059f <+45>: mov %eax,%esi
0x00000000004005a1 <+47>: mov $0x40064a,%edi
0x00000000004005a6 <+52>: mov $0x0,%eax
0x00000000004005ab <+57>: callq 0x400410 <printf#plt>
0x00000000004005b0 <+62>: leaveq
0x00000000004005b1 <+63>: retq
Question 2
I notice that I can add 5 bytes to the return address in place of 7 and achieve the same result. When I do so, am I not jumping into the middle of the instruction 0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)? In which case, why does this not crash the program, like when I add 6 bytes to the return address in place of 5 bytes or 7 bytes.
Question 3
Just before buffer1[] on the stack is SFP, and before it, the return address.
That is 4 bytes pass the end of buffer1[]. But remember that buffer1[] is
really 2 word so its 8 bytes long. So the return address is 12 bytes from
the start of buffer1[].
In the example by Aleph 1, he/she calculates the offset of the return address as 12 bytes from the start of buffer1[]. Since I am on x86_64, and not x86_32, I need to recalculate the offset to the return address. When on x86_64, is it the case that buffer1[] is still 2 words, which is 16 bytes; and the SFP and return address are 8 bytes each (as we're on 64 bit) and therefore the return address is at: buf1 + (8 * 2) + 8 which is equivalent to buf1 + 24?
The first, and very important, thing to note: all numbers and offsets are very compiler-dependent. Different compilers, and even the same compiler with different settings, can produce drastically different assemblies. For example, many compilers can (and will) remove buf2 because it's not used. They can also remove x = 0 as its effect is not used and later overwritten. They can also remove x = 1 and replace all occurences of x with a constant 1, etc, etc.
That said, you absolutely need to make numbers for a specific assembly you're getting on your specific compiler and its settings.
Question 1
Since you provided the assembly for main(), I can confirm that you need to add 7 bytes to the return address, which would normally be 0x0000000000400595, to skip over x=1 and go to 0x000000000040059c which loads x into register for later use. 0x000000000040059c - 0x0000000000400595 = 7.
Question 2
Adding just 5 bytes instead of 7 will indeed jump into middle of instruction. However, this 2-byte tail of instruction happen (by pure chance) to be another valid instruction code. This is why it doesn't crash.
Question 3
This is again very compiler and settings dependent. Pretty much everything can happen there. Since you didn't provide disassembly, I can only make guesses. The guess would be the following: buf and buf2 are rounded up to the next stack unit boundary (8 bytes on x64). buf becomes 8 bytes, and buf2 becomes 16 bytes. Frame pointers are not saved to stack on x64, so no "SFP". That's 24 bytes total.

Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?

That's what I understood by reading some memory segmentation documents: when a function is called, there are a few instructions (called function prologue) that save the frame pointer on the stack, copy the value of the stack pointer into the base pointer and save some memory for local variables.
Here's a trivial code I am trying to debug using GDB:
void test_function(int a, int b, int c, int d) {
int flag;
char buffer[10];
flag = 31337;
buffer[0] = 'A';
}
int main() {
test_function(1, 2, 3, 4);
}
The purpose of debugging this code was to understand what happens in the stack when a function is called: so I had to examine the memory at various step of the execution of the program (before calling the function and during its execution). Although I managed to see things like the return address and the saved frame pointer by examining the base pointer, I really can't understand what I'm going to write after the disassembled code.
Disassembling:
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000400509 <+0>: push rbp
0x000000000040050a <+1>: mov rbp,rsp
0x000000000040050d <+4>: mov ecx,0x4
0x0000000000400512 <+9>: mov edx,0x3
0x0000000000400517 <+14>: mov esi,0x2
0x000000000040051c <+19>: mov edi,0x1
0x0000000000400521 <+24>: call 0x4004ec <test_function>
0x0000000000400526 <+29>: pop rbp
0x0000000000400527 <+30>: ret
End of assembler dump.
(gdb) disassemble test_function
Dump of assembler code for function test_function:
0x00000000004004ec <+0>: push rbp
0x00000000004004ed <+1>: mov rbp,rsp
0x00000000004004f0 <+4>: mov DWORD PTR [rbp-0x14],edi
0x00000000004004f3 <+7>: mov DWORD PTR [rbp-0x18],esi
0x00000000004004f6 <+10>: mov DWORD PTR [rbp-0x1c],edx
0x00000000004004f9 <+13>: mov DWORD PTR [rbp-0x20],ecx
0x00000000004004fc <+16>: mov DWORD PTR [rbp-0x4],0x7a69
0x0000000000400503 <+23>: mov BYTE PTR [rbp-0x10],0x41
0x0000000000400507 <+27>: pop rbp
0x0000000000400508 <+28>: ret
End of assembler dump.
I understand that "saving the frame pointer on the stack" is done by " push rbp", "copying the value of the stack pointer into the base pointer" is done by "mov rbp, rsp" but what is getting me confused is the lack of a "sub rsp $n_bytes" for "saving some memory for local variables". I've seen that in a lot of exhibits (even in some topics here on stackoverflow).
I also read that arguments should have a positive offset from the base pointer (after it's filled with the stack pointer value), since if they are located in the caller function and the stack grows toward lower addresses it makes perfect sense that when the base pointer is updated with the stack pointer value the compiler goes back in the stack by adding some positive numbers. But my code seems to store them in a negative offset, just like local variables.. I also can't understand why they are put in those registers (in the main).. shouldn't they be saved directly in the rsp "offsetted"?
Maybe these differences are due to the fact that I'm using a 64 bit system, but my researches didn't lead me to anything that would explain what I am facing.
The System V ABI for x86-64 specifies a red zone of 128 bytes below %rsp. These 128 bytes belong to the function as long as it doesn't call any other function (it is a leaf function).
Signal handlers (and functions called by a debugger) need to respect the red zone, since they are effectively involuntary function calls. All of the local variables of your test_function, which is a leaf function, fit into the red zone, thus no adjustment of %rsp is needed. (Also, the function has no visible side-effects and would be optimized out on any reasonable optimization setting).
You can compile with -mno-red-zone to stop the compiler from using space below the stack pointer. Kernel code has to do this because hardware interrupts don't implement a red-zone.
But my code seems to store them in a negative offset, just like local variables
The first x86_64 arguments are passed on registers, not on the stack. So when rbp is set to rsp, they are not on the stack, and cannot be on a positive offset.
They are being pushed only to:
save register state for a second function call.
In this case, this is not required since it is a leaf function.
make register allocation easier.
But an optimized allocator could do a better job without memory spill here.
The situation would be different if you had:
x86_64 function with lots of arguments. Those that don't fit on registers go on the stack.
IA-32, where every argument goes on the stack.
the lack of a "sub rsp $n_bytes" for "saving some memory for local variables".
The missing sub rsp on red zone of leaf function part of the question had already been asked at: Why does the x86-64 GCC function prologue allocate less stack than the local variables?

Resources