While coding in C, I came across the below situation.
int function ()
{
if (!somecondition) return false;
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
Considering the if statement in the above code can return from the function, I can declare the variables in two places.
Before the if statement.
After the if statement.
As a programmer, I would think to keep the variable declaration after if Statement.
Does the declaration place cost something? Or is there some other reason to prefer one way over the other?
In C99 and later (or with the common conforming extension to C89), you are free to mix statements and declarations.
Just as in earlier versions (only more so as compilers got smarter and more aggressive), the compiler decides how to allocate registers and stack, or do any number of other optimizations conforming to the as-if-rule.
That means performance-wise, there's no expectation of any difference.
Anyway, that was not the reason such was allowed:
It was for restricting scope, and thus reducing the context a human must keep in mind when interpreting and verifying your code.
Do whatever makes sense, but current coding style recommends putting variable declarations as close to their usage as possible
In reality, variable declarations are free on virtually every compiler after the first one. This is because virtually all processors manage their stack with a stack pointer (and possibly a frame pointer). For example, consider two functions:
int foo() {
int x;
return 5; // aren't we a silly little function now
}
int bar() {
int x;
int y;
return 5; // still wasting our time...
}
If I were to compile these on a modern compiler (and tell it not to be smart and optimize out my unused local variables), I'd see this (x64 assembly example.. others are similar):
foo:
push ebp
mov ebp, esp
sub esp, 8 ; 1. this is the first line which is different between the two
mov eax, 5 ; this is how we return the value
add esp, 8 ; 2. this is the second line which is different between the two
ret
bar:
push ebp
mov ebp, esp
sub esp, 16 ; 1. this is the first line which is different between the two
mov eax, 5 ; this is how we return the value
add esp, 16 ; 2. this is the second line which is different between the two
ret
Note: both functions have the same number of opcodes!
This is because virtually all compilers will allocate all of the space they need up front (barring fancy things like alloca which are handled separately). In fact, on x64, it is mandatory that they do so in this efficient manner.
(Edit: As Forss pointed out, the compiler may optimize some of the local variables into registers. More technically, I should be arguing that the first varaible to "spill over" into the stack costs 2 opcodes, and the rest are free)
For the same reasons, compilers will collect all of the local variable declarations, and allocate space for them right up front. C89 requires all declarations to be up-front because it was designed to be a 1 pass compiler. For the C89 compiler to know how much space to allocate, it needed to know all of the variables before emitting the rest of the code. In modern languages, like C99 and C++, compilers are expected to be much smarter than they were back in 1972, so this restriction is relaxed for developer convenience.
Modern coding practices suggest putting the variables close to their usage
This has nothing to do with compilers (which obviously could not care one way or another). It has been found that most human programmers read code better if the variables are put close to where they are used. This is just a style guide, so feel free to disagree with it, but there is a remarkable consensus amongst developers that this is the "right way."
Now for a few corner cases:
If you are using C++ with constructors, the compiler will allocate the space up front (since it's faster to do it that way, and doesn't hurt). However, the variable will not be constructed in that space until the correct location in the flow of the code. In some cases, this means putting the variables close to their use can even be faster than putting them up front... flow control might direct us around the variable declaration, in which case the constructor doesn't even need to be called.
alloca is handled on a layer above this. For those who are curious, alloca implementations tend to have the effect of moving the stack pointer down some arbitrary amount. Functions using alloca are required to keep track of this space in one way or another, and make sure the stack pointer gets re-adjusted upwards before leaving.
There may be a case where you usually need 16-bytes of stack space, but on one condition you need to allocate a local array of 50kB. No matter where you put your variables in the code, virtually all compilers will allocate 50kB+16B of stack space every time the function gets called. This rarely matters, but in obsessively recursive code this could overflow the stack. You either have to move the code working with the 50kB array into its own function, or use alloca.
Some platforms (ex: Windows) need a special function call in the prologue if you allocate more than a page worth of stack space. This should not change analysis very much at all (in implementation, it is a very fast leaf function that just pokes 1 word per page).
In C, I believe all variable declarations are applied as if they were at the top of the function declaration; if you declare them in a block, I think it's just a scoping thing (I don't think it's the same in C++). The compiler will perform all optimizations on the variables, and some may even effectively disappear in the machine code in higher optimizations. The compiler will then decide how much space is needed by the variables, and then later, during execution, create a space known as the stack where the variables live.
When a function is called, all of the variables that are used by your function are put on the stack, along with information about the function that is called (i.e. the return address, parameters, etc.). It doesn't matter where the variable was declared, just that it was declared - and it will be allocated onto the stack, regardless.
Declaring variables isn't "expensive," per se; if it's easy enough to be not used as a variable, the compiler will probably remove it as a variable.
Check this out:
Wikipedia on call stacks, Some other place on the stack
Of course, all of this is implementation-dependent and system-dependent.
Yes, it can cost clarity. If there is a case where the function must do nothing at all on some condition, (as when finding the global false, in your case), then placing the check at the top, where you show it above, is surely easier to understand - something that is essential while debugging and/or documenting.
It ultimately depends on the compiler but usually all locals are allocated at the beginning of the function.
However, the cost of allocating local variables is very small as they are put on the stack (or are put in a register after optimization).
Keep the declaration as close to where it's used as possible. Ideally inside nested blocks. So in this case it would make no sense to declare the variables above the if statement.
The best practice is to adapt a lazy approach, i.e., declare them only when you really need them ;) (and not before). It results in the following benefit:
Code is more readable if those variables are declared as near to the place of usage as possible.
If you have this
int function ()
{
{
sometype foo;
bool somecondition;
/* do something with foo and compute somecondition */
if (!somecondition) return false;
}
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
then the stack space reserved for foo and somecondition can be obviously reused for str1etc., so by declaring after the if, you may save stack space. Depending on the optimization capabilities of the compiler, the saving of stack space may also take place if you flatten the fucntion by removing the inner pair of braces or if you do declare str1 etc. before the if; however, this requires the compiler/optimizer to notice that the scopes do not "really" overlap. By positining the declarations after the if you facilitate this behaviour even without optimization - not to mention the improved code readability.
Whenever you allocate local variables in a C scope (such as a functions), they have no default initialization code (such as C++ constructors). And since they're not dynamically allocated (they're just uninitialized pointers), no additional (and potentially expensive) functions need to be invoked (e.g. malloc) in order to prepare/allocate them.
Due to the way the stack works, allocating a stack variable simply means decrementing the stack pointer (i.e. increasing the stack size, because on most architectures, it grows downwards) in order to make room for it. From the CPU's perspective, this means executing a simple SUB instruction: SUB rsp, 4 (in case your variable is 4 bytes large--such as a regular 32-bit integer).
Moreover, when you declare multiple variables, your compiler is smart enough to actually group them together into one large SUB rsp, XX instruction, where XX is the total size of a scope's local variables. In theory. In practice, something a little different happens.
In situations like these, I find GCC explorer to be an invaluable tool when it comes to finding out (with tremendous ease) what happens "under the hood" of the compiler.
So let's take a look at what happens when you actually write a function like this: GCC explorer link.
C code
int function(int a, int b) {
int x, y, z, t;
if(a == 2) { return 15; }
x = 1;
y = 2;
z = 3;
t = 4;
return x + y + z + t + a + b;
}
Resulting assembly
function(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
cmp DWORD PTR [rbp-20], 2
jne .L2
mov eax, 15
jmp .L3
.L2:
-- snip --
.L3:
pop rbp
ret
As it turns out, GCC is even smarter than that. It doesn't even perform the SUB instruction at all to allocate the local variables. It just (internally) assumes that the space is "occupied", but doesn't add any instructions to update the stack pointer (e.g. SUB rsp, XX). This means that the stack pointer is not kept up to date but, since in this case no more PUSH instructions are performed (and no rsp-relative lookups) after the stack space is used, there's no issue.
Here's an example where no additional variables are declared: http://goo.gl/3TV4hE
C code
int function(int a, int b) {
if(a == 2) { return 15; }
return a + b;
}
Resulting assembly
function(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
cmp DWORD PTR [rbp-4], 2
jne .L2
mov eax, 15
jmp .L3
.L2:
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
.L3:
pop rbp
ret
If you take a look at the code before the premature return (jmp .L3, which jumps to the cleanup and return code), no additional instructions are invoked to "prepare" the stack variables. The only difference is that the function parameters a and b, which are stored in the edi and esi registers, are loaded onto the stack at a higher address than in the first example ([rbp-4] and [rbp - 8]). This is because no additional space has been "allocated" for the local variables like in the first example. So, as you can see, the only "overhead" for adding those local variables is a change in a subtraction term (i.e. not even adding an additional subtraction operation).
So, in your case, there is virtually no cost for simply declaring stack variables.
I prefer keeping the "early out" condition at the top of the function, in addition to documenting why we are doing it. If we put it after a bunch of variable declarations, someone not familiar with the code could easily miss it, unless they know they have to look for it.
Documenting the "early out" condition alone is not always sufficient, it is better to make it clear in the code as well. Putting the early out condition at the top also makes it easier to keep the document in sync with the code, for instance, if we later decide to remove the early out condition, or to add more such conditions.
If it actually mattered the only way to avoid allocating the variables is likely to be:
int function_unchecked();
int function ()
{
if (!someGlobalValue) return false;
return function_unchecked();
}
int function_unchecked() {
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
But in practice I think you'll find no performance benefit. If anything a minuscule overhead.
Of course if you were coding C++ and some of those local variables had non-trivial constructors you would probably need to place them after the check. But even then I don't think it would help to split the function.
If you declare variables after if statement and returned from the function immediately the compiler does not commitment memory in the stack.
Related
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.
I was wondering if there would be a convenient way to copy the current stack frame, move it somewhere else, and then 'return' from the function, from the new location?
I have been playing around with setjmp and longjmp while allocating large arrays on the stack to force the stack pointer away. I am familiar with the calling conventions and where arguments to functions end up etc, but I am not extremely experienced with pointer arithmetic.
To describe the end goal in general terms; The ambition is to be able to allocate stack frames and to jump to another stack frame when I call a function (we can call this function switch). Before I jump to the new stack frame, however, I'd like to be able to grab the return address from switch so when I've (presumably) longjmpd to the new frame, I'd be able to return to the position that initiated the context switch.
I've already gotten some inspiration of how to imitate coroutines using longjmp an setjmp from this post.
If this is possible, it would be a component of my current research, where I am trying to implement a (very rough) proof of concept extension in a compiler. I'd appreciate answers and comments that address the question posed in my first paragraph, only.
Update
To try and make my intention clearer, I wrote up this example in C. It needs to be compiled with -fno-stack-protector. What i want is for the local variables a and b in main to not be next to each other on the stack (1), but rather be separated by a distance specified by the buffer in call. Furthermore, currently this code will return to main twice, while I only want it to do so once (2). I suggest you read the procedures in this order: main, call and change.
If anyone could answer any of the two question posed in the paragraph above, I would be immensely grateful. It does not have to be pretty or portable.
Again, I'd prefer answers to my questions rather than suggestions of better ways to go about things.
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
jmp_buf* buf;
long* retaddr;
int change(void) {
// local variable to use when computing offsets
long a[0];
for(int i = 0; i < 5; i++) a[i]; // same as below, not sure why I need to read this
// save this context
if(setjmp(*buf) == 0) {
return 1;
}
// the following code runs when longjmp was called with *buf
// overwrite this contexts return address with the one used by call
a[2] = *retaddr;
// return, hopefully now to main
return 1;
}
static void* retain;
int call() {
buf = (jmp_buf*)malloc(sizeof(jmp_buf));
retaddr = (long*) malloc(sizeof(long));
long a[0];
for(int i = 0; i < 5; i++) a[i]; // not sure why I need to do this. a[2] reads (nil) otherwise
// store return address
*retaddr = a[2];
// allocate local variables to move the stackpointer
char n[1024];
retain = n; // maybe cheat the optimiser?
// get a jmp_buf from another context
change();
// jump there
longjmp(*buf, 1);
}
// It returns to main twice, I am not sure why
int main(void) {
char a;
call(); // this function should move stackpointer (in this case, 1024 bytes)
char b;
printf("address of a: %p\n", &a);
printf("address of b: %p\n", &b);
return 1;
}
This is possible, it is what multi-tasking schedulers do, e.g. in embedded environments.
It is however extremely environment-specific and would have to dig into the the specifics of the processor it is running on.
Basically, the possible steps are:
Determine the registers which contain the needed information. Pick them by what you need, they are probably different from what the compiler uses on the stack for implementing function calls.
Find out how their content can be stored (most likely specific assembler instructions for each register).
Use them to store all contents contiguosly.
The place to do so is probably allocated already, inside the object describing and administrating the current task.
Consider not using a return address. Instead, when done with the "inserted" task, decide among the multiple task datasets which describe potential tasks to return to. That is the core of scheduling. If the return address is known in advance, then it is very similar to normal function calling. I.e. the idea is to potentially return to a different task than the last one left. That is also the reason why tasks need their own stack in many cases.
By the way, I don't think that pointer arithmetic is the most relevant tool here.
The content of the registers which make the stack frame are in registers, not anywhere in memory which a pointer can point to. (At least in most current systems, C64 staying out of this....).
tl;dr - no.
(On every compiler worth considering): The compiler knows the address of local variables by their offset from either the sp, or a designated saved stack pointer, the frame or base pointer. a might have an address of (sp+1), and b might have an address of (sp+0). If you manage to successfully return to main with the stack pointer lowered by 1024; these will still be known as (sp+1), (sp+0); although they are technically now (sp+1-1024), (sp+0-1024), which means they are no longer a & b.
You could design a language which fixed the local allocation in the way you consider, and that might have some interesting expressiveness, but it isn't C. I doubt any existing compiler could come up with a consistent handling of this. To do so, when it encountered:
char a;
it would have to make an alias of this address at the point it encountered it; say:
add %sp, $0, %r1
sub %sp, $1, %sp
and when it encountered
char b;
add %sp, $0, %r2
sub %sp, $1, %sp
and so on, but one it runs out of free registers, it needs to spill them on the stack; and because it considers the stack to change without notice, it would have to allocate a pointer to this spill area, and keep that stored in a register.
Btw, this is not far removed from the concept of a splayed stack (golang uses these), but generally the granularity is at a function or method boundary, not between two variable definitions.
Interesting idea though.
When allocating a int as well as a large array on the stack in C, the program executes without error. If I however, initialize the variable on the stack beforehand, it crashes with a segfault (probably because the stack size was exceeded by the large array). If initializing the variable after declaring the array this would make sense to me.
What causes this behavior, memory wise?
I was under the impression, that by simply declaring a variable on the stack, the needed space would be allocated, leading to an immediate crash when allocating very large datatypes.
My suspicion is that it has something to do with the compiler optimizing it away, but it does not make sense, considering I am not changing foo in the second example either.
I am using gcc 7.2.0 to compile, without any flags set. Executed on Ubuntu 17.10.
This runs without errors:
int main(){
int i;
unsigned char foo [1024*1024*1024];
return 0;
}
while this crashes immediately:
int main(){
int i = 0;
unsigned char foo [1024*1024*1024];
return 0;
}
Can somebody give me some insight what is happening here?
Note: What follows are implementation details. The C standard does not cover this.
The crash is not caused by allocating space. The crash is caused by writing to pages which are not writable, or reading from pages which are not readable.
You can see that a declaration doesn't actually need to read or write any memory, not necessarily:
int i;
But if it is initialized, you have to write the value:
int i = 0;
This triggers the crash. Note that the exact behavior will depend on the compiler you use and the optimization settings you have. Different compilers will allocate variables in different ways, and an optimizing compiler will normally remove both i and foo from the function entirely, since they aren't needed. Some compilers will also initialize variables to garbage values under certain configurations, to aid with debugging.
Allocating stack space just involves changing the stack pointer, which is a register. If you allocate too much stack space, the stack pointer will point to an invalid region of memory, and the program will segfault when it tries to read or write to those addresses. Most operating systems have “guard pages” so valid memory will not be placed next to the stack, ensuring that the program successfully crashes in most scenarios.
Here is some output from Godbolt:
main:
push rbp
mov rbp, rsp
sub rsp, 1073741720 ; allocate space for locals
mov DWORD PTR [rbp-4], 0 ; initialize i = 0
mov eax, 0 ; return value = 0
leave
ret
Note that this version does not crash, because i is placed at the top of the stack (which grows downwards). If i is placed at the bottom of the stack, this will likely crash. The compiler is free to put the variables on the stack in any order, so whether it actually crashes will depend heavily on the specific compiler you are using.
You can also see more clearly why the allocation won't crash:
; Just an integer subtraction. Why would it crash?
sub rsp 1073741720
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.
I know C++11 has move semantics, which mean you can directly return a struct from a function and not worry about it being copied (assuming a simple struct), as opposed to writing the struct through an output parameter.
Does C11 have anything like this? Or do returned structs still get copied every time? Are output parameters still the "best practice" here?
I think there is some confusion here which should be clarified. The semantics and the implementation of C++ are different.
"Move" versus "copy" in C++ is just a question of which constructor (or operator=) you are invoking.
The question of whether the representation of the structure's members is copied is an entirely separate question. In other words, "does the processor have to move these bytes around?" is not part of the language semantics.
Semantics
MyClass func() {
MyClass x;
x.method(...);
...
return x;
}
This returns using move semantics if available, but even prior to C++11, return value optimization was available.
The reason why we prefer to use move semantics is because moving an object doesn't cause a deep copy, e.g., if you move a std::vector<T> you don't have to copy all of the T. However, you are still copying data! So, std::move(x) is semantically speaking a move operation (think of it as using linear rather than classical logic) but it is still implemented by copying data in memory.
Unless your ABI lets you avoid the copy. Which brings us to...
Implementation
When you call a function that returns a large structure (the term "large" is relative, it might only be a few words), most ABIs will call for that structure to be passed by reference to the function. So when you write something like this:
MyClass func() { ... }
Once you look at it in assembly, it might look something more like this:
void func(MyClass *ptr) { ... }
Of course, this is a simplification! The pointer is usually implicit. But the important point is that we are already avoiding a copy, sometimes.
Case study
Here is a simple example:
struct big {
int x[100];
};
struct big func1(void);
int func2() {
struct big x = func1();
struct big y = func1();
return x.x[0] + y.x[0];
}
When I compile this with gcc -O2 on x64, I get the following assembly output:
subq $808, %rsp
movq %rsp, %rdi
call func1
leaq 400(%rsp), %rdi
call func1
movl 400(%rsp), %eax
addl (%rsp), %eax
addq $808, %rsp
ret
You can see that nowhere is struct big copied. The result from func2() simply gets placed on the stack for func1(), and then the stack gets moved so the next result gets placed elsewhere.
However
In common ABIs, large function results won't get threaded through multiple function stacks. Returning x or y from func2() above will result in the structure being copied.
But remember! This has nothing to do with move semantics versus copy semantics, because the structure data getting copied is just an implementation detail rather than language semantics. In C++, using std::move() may still result in the structure getting copied, it just won't invoke the copy constructor.
The conclusion: returning a large structure from either C or C++ may result in copying the structure, depending on the particulars of the ABI, how the function is optimized, and the code in question. However, I wouldn't worry about it if the structure is only a few words long.