Related
the case is
int func(void){
int A = 10;
int B = 20;
return A+B
}
which is being called by the main function
int main(void){
int retVal = func();
return 0;
}
in the function func() two local variables will be stored onto the stack for the scope of func() but where does the result of A+B stored?
and over the call by reference, how reliable this method is?
what is the difference between following function bodies
int func(void){
int A = 20;
return A;
}
and
int* func(void){
int A = 20;
return &A;
}
why returning the values does not throw the error of the segmentation fault but returning the address do?
where does the result of A+B stored?
This strongly depends on specific architecture and specific calling convension - every architecture is different. Let's inspect the most common one - x86-64 on Linux (see https://en.wikipedia.org/wiki/X86-64 , https://en.wikipedia.org/wiki/X86_calling_conventions#cdecl , Where is the x86-64 System V ABI documented? , https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture ).
The function you presented is compiled by gcc to godbolt link:
func:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 10
mov DWORD PTR [rbp-8], 20
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
On x86-64 the return value is stored inside eax register. The add eax, edx puts the result of the addition inside eax register. After setting eax, the function than returns, and main can read the content of eax register if he wants to get the return value.
Given that you tagged this with the "C" keyword, it is worth saying that the intent in the early days of C was that the return value, as an integer or a pointer, should fit in a processor register, so no memory is allocated to storing the value.
The calling function may need to declare a variable to store the result into, and it is responsible for that allocation. Immediately on return from the function the caller will stash that agreed processor register value into the memory it has reserved. Of course, that may not be necessary if the value is used immediately for some other calculation.
When returning a pointer, what the pointer points to is a problem for the programmer: you. As you have found, if you try to access a value you only declared as a local variable in the called function and returned using a pointer, the local variable space - the function call stack - is heavily reused, and your value will quickly be junked.
Of course, you can return floating point values and structs in modern C and C++, which often needs different handling. Usually the caller function has to reserve space for the called function to store these larger objects into.
Note that the compilers are often able to inline and optimise code to use available registers, rather than repeatedly using the agreed registers, and sometimes even replace small structures with a set of registers.
Tools like godbolt can let you easily see what the compiler has done to your code.
in the function func() two local variables will be stored onto the stack for the scope of func() but where does the result of A+B stored?
Depends on the specific calling convention for the target architecture, usually in a register (such eax on x86).
what is the difference between following function bodies
int func(void){
int A = 20;
return A;
}
and
int* func(void){
int A = 20;
return &A;
}
In the first case you are returning the result of the expression A, which is simply the integer value 20; IOW, the value 20 is written to some register or other memory location, which is read by the calling function.
In the second case you are returning the result of the expression &A, which is the address of the variable A in func. The problem with this is that once func exits A ceases to exist and that memory location becomes available for something else to use; the pointer value is no longer valid, and the behavior on dereferencing an invalid pointer is undefined.
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.
I feel kind of dumb, but I'm struggling with dereferencing a pointer (+ adding an offset) in C.
What I want to recreate in C is this behavior:
movabs rax, 0xdeadbeef
add rax, 0xa
mov rax, QWORD PTR [rax]
So at the end rax should be: *(0xdeadbeef+0xa)
Especially the equivalent to mov rax, QWORD PTR [rax] would be improtant, as I need to use the calculated value and retrieve the data (=a different address) that is being stored at that point.
I tried so many things, but here is my current stage:
void *ptr = (void*)0xdeadbeef;
void *ptr2 = *(void*)(ptr+0xa);
Which translates to sth like this:
0x7ffff7fe6050: mov QWORD PTR [rbp-0x38],rax
0x7ffff7fe6054: mov rax,QWORD PTR [rbp-0x38]
0x7ffff7fe6058: add rax,0xa
EDIT: It does not actually compile, I made a mistake with the provided C code here and can't figure out which code actually compiled to this. It's not that important anyways as the main target was the translation of ASM to C and the problem is solved now. Thanks everyone for participating.
So the first 2 lines are basically useless and just the value is added to my address and nothing more. I need it to be interpreted as an address and retrieve the value at that point though.
The data stored at those places doesn't matter at this point. Essentially what I want to do is find a specific value in memory and I know a way of adding offsets and dereferencing pointers to get to my goal. The final step will just be a typecast from my address to the actual datatype at that point.
I know this may seem trivial to some of you, but I'm not super familiar with C, so I'm struggling here...
You can simplify your asm to a single instruction, with the math done at assemble time. movabs rax, [0xdeadbeef + 0xa] can use the AL/AX/EAX/RAX-only form of mov that loads from a 64-bit absolute address (https://felixcloutier.com/x86/MOV.html). (It won't fit in a 32-bit sign-extended disp32, because the high bit of the low 32 is set, unlike normal static addresses in position-dependent code). Regular mov with a 32-bit address-size override would work, too, in about 7 bytes, because your address does fit in a zero-extended 32-bit integer.
In C you can also do the whole thing with a single statement. No need to overcomplicate things: your address is a pointer to a pointer, so you need to cast your integer to a x ** type.
void *ptr = *(const void**)(0xdeadbeefUL + 0xa);
In asm pointers are just integers, so it makes sense to do your math using integers instead of char*. Making it unsigned guarantees it zero-extends to pointer-width instead of sign-extending.
(Numeric literals in C have a type wide enough to represent the value, though, so 0xdeadbeef on an x86-64 compiler would be an int64_t (long long). You wouldn't actually get 0xdeadbeef being a negative 32-bit int that sign-extended to 0xffffffffdeadbeef.)
Since void doesn't have a size, you can't add / subtract integers to a void*. And pointer-math on void ** would be in chunks of sizeof(void*).
To avoid undefined behaviour from dereferencing a void** that's not aligned by 8 = alignof(void*) (in both mainstream x86-64 ABIs), you'd want to use memcpy. But I assume your example address is just a fake example. The mainstream x86 compilers like gcc don't do anything weird with unaligned addresses to punish programmers for UB, so the compiler output will contain unaligned loads which work fine on x86. But when auto-vectorizing you can run into problems from this kind of UB. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
But if you did for some reason want to break things up into multiple asm statements, you could transliterate it into multiple C statements like this:
uintptr_t wheres_the_beef = 0xdeadbeef; // mov eax, 0xdeadbeef
wheres_the_beef += 0xa; // add eax, 0xa
void **address = (void**)wheres_the_beef; // purely a cast, no asm instructions;
void *ptr = *address; // mov rax, [rax]
You could mess around with char* if you wanted to add byte offsets to pointers, but there's really no point here.
Again, this still has undefined behaviour on most C implementations, where alignof(void*) is greater than 1 so void **address = (void**)wheres_the_beef creates a misaligned pointer.
(Fun fact: even creating misaligned pointers is UB in ISO C. But all x86 compilers that support Intel's intrinsics must support creating of misaligned pointers for passing them to intrinsics like _mm_loadu_ps(), so only actually dereferencing them is a potential problem on x86 compilers.)
Example Showing the gcc Optimization and User Code that May Fault
The function 'foo' in the snippet below will load only one of the struct members A or B; well at least that is the intention of the unoptimized code.
typedef struct {
int A;
int B;
} Pair;
int foo(const Pair *P, int c) {
int x;
if (c)
x = P->A;
else
x = P->B;
return c/102 + x;
}
Here is what gcc -O3 gives:
mov eax, esi
mov edx, -1600085855
test esi, esi
mov ecx, DWORD PTR [rdi+4] <-- ***load P->B**
cmovne ecx, DWORD PTR [rdi] <-- ***load P->A***
imul edx
lea eax, [rdx+rsi]
sar esi, 31
sar eax, 6
sub eax, esi
add eax, ecx
ret
So it appears that gcc is allowed to speculatively load both struct members in order to eliminate branching. But then, is the following code considered undefined behavior or is the gcc optimization above illegal?
#include <stdlib.h>
int naughty_caller(int c) {
Pair *P = (Pair*)malloc(sizeof(Pair)-1); // *** Allocation is enough for A but not for B ***
if (!P) return -1;
P->A = 0x42; // *** Initializing allocation only where it is guaranteed to be allocated ***
int res = foo(P, 1); // *** Passing c=1 to foo should ensure only P->A is accessed? ***
free(P);
return res;
}
If the load speculation will happen in the above scenario there is a chance that loading P->B will cause an exception because the last byte of P->B may lie in unallocated memory. This exception will not happen if the optimization is turned off.
The Question
Is the gcc optimization shown above of load speculation legal? Where does the spec say or imply that it's ok?
If the optimization is legal, how is the code in 'naughtly_caller' turn out to be undefined behavior?
Reading a variable (that was not declared as volatile) is not considered to be a "side effect" as specified by the C standard. So the program is free to read a location and then discard the result, as far as the C standard is concerned.
This is very common. Suppose you request 1 byte of data from a 4 byte integer. The compiler may then read the whole 32 bits if that's faster (aligned read), and then discard everything but the requested byte. Your example is similar to this but the compiler decided to read the whole struct.
Formally this is found in the behavior of "the abstract machine", C11 chapter 5.1.2.3. Given that the compiler follows the rules specified there, it is free to do as it pleases. And the only rules listed are regarding volatile objects and sequencing of instructions. Reading a different struct member in a volatile struct would not be ok.
As for the case of allocating too little memory for the whole struct, that's undefined behavior. Because the memory layout of the struct is usually not for the programmer to decide - for example the compiler is allowed to add padding at the end. If there's not enough memory allocated, you might end up accessing forbidden memory even though your code only works with the first member of the struct.
No, if *P is allocated correctly P->B will never be in unallocated memory. It might not be initialized, that is all.
The compiler has every right to do what they do. The only thing that is not allowed is to oops about the access of P->B with the excuse that it is not initialized. But what and how they do all of this is under the discretion of the implementation and not your concern.
If you cast a pointer to a block returned by malloc to Pair* that is not guaranteed to be wide enough to hold a Pair the behavior of your program is undefined.
This is perfectly legal because reading some memory location isn't considered an observable behavior in the general case (volatile would change this).
Your example code is indeed undefined behavior, but I can't find any passage in the standard docs that explicitly states this. But I think it's enough to have a look at the rules for effective types ... from N1570, ยง6.5 p6:
If a value is stored into an object having no declared type through an
lvalue having a type that is not a character type, then the type of the lvalue becomes the
effective type of the object for that access and for subsequent accesses that do not modify
the stored value.
So, your write access to *P actually gives that object the type Pair -- therefore it just extends into memory you didn't allocate, the result is an out of bounds access.
A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points
If invoking the expression P->A is well-defined, then P must actually point to an object of type struct Pair, and consequently P->B is well-defined as well.
A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (#Hurkyl quotes the standard.)
x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.
Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)
To get the behaviour you were expecting, use a const int * arg.
An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).
In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.
Source + gcc and clang output for this and other variations on the Godbolt compiler explorer
// exactly equivalent to const int p[2]
int load_pointer(const int *p, int c) {
int x;
if (c)
x = p[0];
else
x = p[1]; // gcc missed optimization: still does an add with c known to be zero
return c + x;
}
load_pointer: # gcc7.2 -O3
test esi, esi
jne .L9
mov eax, DWORD PTR [rdi+4]
add eax, esi # missed optimization: esi=0 here so this is a no-op
ret
.L9:
mov eax, DWORD PTR [rdi]
add eax, esi
ret
In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]
int load_array(const int p[static 2], int c) {
... // same body
}
But gcc doesn't take advantage, and emits identical code to load_pointer.
Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.
lea rax, [rdi + 4]
test esi, esi
cmovne rax, rdi
add esi, dword ptr [rax]
mov eax, esi # missed optimization: mov on the critical path
ret
This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.
We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.
int load_pointer_v3(const int *p, int c) {
int offset = (c==0);
int x = p[offset];
return c + x;
}
xor eax, eax
test esi, esi
sete al
add esi, dword ptr [rdi + 4*rax]
mov eax, esi
ret
gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:
xor eax, eax
test esi, esi
sete al
mov eax, dword ptr [rdi + 4*rax]
add eax, esi
ret
Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.
After inlining, that final mov probably wouldn't exist, and it could just add into esi.
This is always allowed under the "as-if" rule if no conforming program can tell the difference. For example, an implementation could guarantee that after each block allocated with malloc, there are at least eight bytes that can be accessed without side effects. In that situation, the compiler can generate code that would be undefined behaviour if you wrote it in your code. So it would be legal for the compiler to read P[1] whenever P[0] is correctly allocated, even if that would be undefined behaviour in your own code.
But in your case, if you don't allocate enough memory for a struct, then reading any member is undefined behaviour. So here the compiler is allowed to do this, even if reading P->B crashes.
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.