In C, we have data, and we have pointers. Pointers by themselves are binary data. So, at the hardware level there is no difference between data and pointers. Pointers must be an implementation. If we have a variable, that variable has two properties, the data and the address, which itself is a data, in a sense this creates an infinite pointer loop kind of thing. Since every pointer is a data, they must have a pointer that points to them, and the pointer that points to the pointer must have a pointer, and so on.
This type of implementation would only make sense if pointers are created on demand. Let's say we create a variable called a, does C immediately assign a pointer to this variable right after the variable being declared? Or is it when I explicitly try to pull the pointer by doing &a that C creates a pointer according to some internal algorithm?
in a sense this creates an infinite pointer loop kind of thing. Every pointer must have a parent pointer
This is logically incorrect. It's like saying: "A pointer is a street address. Every person has a street address. Therefore a street address is a pointer."
Pointers are variables containing addresses, but that doesn't make addresses pointers... something can have an address without a pointer used to access that address. Just like your machine code can contain integer values without them being stored in int variables.
Or is it when I explicitly try to pull the pointer by doing &a that C creates a pointer according to some internal algorithm?
Yes, kind of. Here the address is actually used by the program so the &a address must be stored somewhere. You can think of it as a temporary pointer variable if it helps. In practice, if we disassemble this code:
int a;
printf("%p\n", &a);
Then on gcc x86 this just results in a "load effective address" instruction. That is, the compiler stores the address in a register which is then passed on to printf as per that function's calling convention.
Your description is all wrong. It seems you are confusing C pointers and memory addresses at execution time. It's completely different things.
Pointers in C are no different from other data types. You get a pointer to type T only when you define one, i.e. like T* p; Just like you only get an int object when you define one (e.g. int a;).
At machine level it's completely different. In order to access objects (aka variables) stored in memory, the CPU needs a way to calculate the address of that object. The C standard does not care how it's done. It's an implementation detail that may differ from system to system.
Many implementations uses a "stack pointer" (stored in a CPU register) as reference for other variables. Instead of knowing the exact address of an object, the address is found as "value of stack pointer" plus an offset (i.e. SP+offset). This offset is then hardcoded into the execution binary, i.e. the instruction set of the CPU can have an instruction that do stuff like: "Read the memory at address SP+fixed_offset and store it in register X".
Take a look at this simple (and rather stupid) function:
unsigned long foo(unsigned long x)
{
unsigned long y = x;
putchar('a');
return y + x;
}
This defines two "unsigned long" objects and return their sum. Using godbolt.org, gcc 11.2, and flag -fomit-frame-pointer (no optimization to keep things simple), I get this machine code (with my comments added):
foo:
sub rsp, 40 // Change stack pointer to reserve 40 bytes
mov QWORD PTR [rsp+8], rdi // Save the passed value (i.e. register rdi)
// in object x at memory address rsp+8
mov rax, QWORD PTR [rsp+8] // Read object x into register rax
mov QWORD PTR [rsp+24], rax // Save rax in object y at memory address rsp+24
// So this is really y = x
mov edi, 97 // These are just
call putchar // putchar('a');
mov rdx, QWORD PTR [rsp+24] // Read object y into register rdx
mov rax, QWORD PTR [rsp+8] // Read object x into register rax
add rax, rdx // rax = rax + rdx, i.e. rax = y + x
add rsp, 40 // Restore stack pointer, i.e. release
// the 40 bytes
ret // Return. The returned value is in register rax
So on this specific system the memory address of x is found as "stack pointer + 8" and the memory address of y is found as "stack pointer + 24". From the machine code we can't tell the actual memory address of the variables as it depends on the value of the stack pointer (rsp) when the function is called.
The lesson is that - yes, at machine level there is a way to get the memory address of x and y but there is no such thing as an automatically created and stored pointer to any of them.
Now for fun - the same code compiled with -O2 gives:
foo:
push rbx
mov rsi, QWORD PTR stdout[rip]
mov rbx, rdi
mov edi, 97
call putc
lea rax, [rbx+rbx]
pop rbx
ret
Take a look at the code and see if you can find x and y ;-)
(but - to repeat - all this is not described by the C standard, it's just how it's done on this specific system).
BTW Also be aware that objects/variables define by the C code may exist only in CPU registers, i.e. the are never written to memory and consequently they don't even have an address.
Every variable is stored in the memory . The address (pointer) is not stored separately, it is simply the location of the variable.
(This is actually more complex than that: Some variables might normally stay in processor registers instead of RAM, but if you take their address, the compiler might store them in memory so that the address can be taken. Or if you just print the address, it might invent a fake address for them.)
If you take the address of a variable and store it in a pointer, you create a new variable, and assign the value of the address of the other variable. But the address of this new variable is not stored separately, it is simply the location of the variable.
There is no "infinite pointer loop", unless your code makes infinite number of variables and your computer has infinite memory.
What determines the address then? How does the program know where the variable is stored?
This is controlled by the operating system, which gives your program memory blocks based on the memory allocations you make. Your program determines (at compile time) which variable is in which address relative to this memory block.
I would not say "we have data, and we have pointers". Yes, pointers are different, but saying it this way doesn't really capture what's different about them.
Any variable has a location (also called an address), and a value (perhaps also called "contents"). If you say
int i = 5;
the value is 5, and we're not precisely sure where the location is (because we usually don't care), although the identifier i helps us keep track of it, whatever it is. But there are definitely two things, the location or address, and the contents or value.
If you say
int *ip = &i;
once again you have a variable named ip, and a value, which in this case is a pointer, or an address, and what it's the address of is the same as that other variable i. And then there's also a value in the pointed-to location. So you now have three things:
the variable ip, and
its value, which is "pointer to i", and
the pointed-to value, which is 5.
Almost nothing gets "created on demand". The variable i got created because you requested it. The variable ip got created because you requested it.
You can draw a picture like this to help you keep track of things:
+---------+
i: | 5 |
+---------+
^
|
+-------|-------+
ip:| * |
+---------------+
The only thing that happens automatically, behind your back, that you can't necessarily see, is the assignment of some actual, numeric addresses for your variables. You can see those if you print them out using %p:
printf("address of i: %p\n", &i);
printf("address of ip: %p\n", &ip);
and you will notice that ip holds i's address:
printf("value of ip: %p\n", ip);
or you can look at everything — i's two things, and ip's three things — like this:
printf("i: loc %p, value %d\n", &i, i);
printf("ip: loc %p, value %p, pointed-to value %d\n", &ip, ip, *ip);
To see what's going on a little more explicitly, let's write this as an actual program. For the moment, I'm going to have the variables i and ip be "global" variables, although this is unusual, because normally they'd be "local" variables, declared inside main. Also I'm going to declare a third variable j.
#include <stdio.h>
int i = 5;
int j = 66;
int *ip = &i;
int main()
{
printf(" i: loc %p, value %d\n", &i, i);
printf(" j: loc %p, value %d\n", &j, j);
printf("ip: loc %p, value %p, pointed-to value %d\n", &ip, ip, *ip);
}
the output I get is
i: loc 0x10837b018, value 5
j: loc 0x10837b01c, value 66
ip: loc 0x10837b020, value 0x10837b018, pointed-to value 5
You'd get different addresses on your computer, but basically a similar pattern.
If you're on a Unix-like system, you can run the nm command to see your program's "namelist", or symbol table, which is a list of all the identifiers in your program, and their addresses. When I run it on the program above, I get something like this:
$ nm a.out
000000010837b018 D _i
000000010837b01c D _j
000000010837b020 D _ip
0000000108370ec0 T _main
This shows me that my program has three things in the "data" segment, which are my variables, and one thing in the "text" segment, which is my main() function. Lo and behold, the locations listed by the nm program for my two variables exactly match what was printed when the program ran. (Although there's a complication here; see below.)
Now that we know where the variables actually are, we could draw a slightly different picture:
+---------+ +---------+
10837b018: | 5 | 10837b01c: | 66 |
+---------+ +---------+
^
|
+-------|-------+
10837b020:| * |
+---------------+
This shows us that, basically, the names or identifiers we use for things in our programs — like i, j, ip, and main — are like labels or shorthands for the addresses in memory where these things are stored. (In fact, these identifiers are therefore kind of like pointers in their own right, although I hesitate to say this, because it might confuse the issue.)
Another way to think of it is this. Imagine you live in a large apartment building. In the lobby is a large row of mailboxes, one for each apartment. Each mailbox, naturally, has a label on or next to it giving the apartment number. So the apartment numbers are like addresses, and the contents of the mailbox are like values.
Finally, two footnotes. On a modern system, the nm command probably won't print out addresses that are the same as your program did, after all, due to something called "address space randomization".
When you print pointers, strictly speaking you should cast them to void *, like this:
printf(" i: loc %p, value %d\n", (void *)&i, i);
printf("ip: loc %p, value %p, pointed-to value %d\n", (void *)&ip, (void *)ip, *ip);
C describes abstract machine, where pointer is pointer, and value is value (distinct from pointer).
When it comes to concrete machine, a pointer or a value may be saved in a register (which eliminates the need and the possibility of pointer to it). So you don't have infinite pointer loop, ultimately you have it in a CPU register.
When you compile your program, there may be no correspondence between C code and machine code, especially if optimizations ae enabled, so that &a actually may create creates no pointer (indirection optimized out), whereas b may create a pointer (added indirection to pass large structure).
You are worrying about things that the compiler handles.
When you assign a variable a, the compiler assigns it to a specific memory location or register - the address. The processor knows this address, and using the address, it can access the value of the variable. It doesn't need to keep a separate variable to remember the address.
There are multiple things a processor can do with the address, depending on the operations it supports, but the most common and the most relevant are direct addressing (access the value at the address number given to you in the operation), indirect addressing (access the value at the address number given to you in the operation AND THEN threat that value like an address and access ANOTHER value) and immediate addressing (treat the number given to you as a regular number). Respectively, they match a, *a and &a in C (this is way oversimplified though).
Conclusion - no, creating a pointer won't create an infinite amount of pointers, in order to keep track of the pointers. The processor treats everything the same - addresses are just values and values can be used as addresses.
This is an answer to your comment to S.Ptr answer.
The processor doesn't "know" anything.
The only thing a processor can do is execute instructions: things like "add two numbers together", "put this variable in RAM"... Before modern compiled programming languages like C all programs where created by listing instructions we wanted the processor to execute. A simple addition could have looked like this:
store 10 in memory at address 1
store 5 in memory at address 5
load the value at address 1 into register A
load the value at address 5 into register B
add B to A and store the result in B
store the result of B in memory at address 2
The programmer had to remember where they put everything because the CPU doesn't "understand" what you want it to do, it just follows your commands blindly.
As an example, let's say you made a mistake and you wrote this:
store 10 in memory at address 1
store 5 in memory at address 5
load the value at address 1 into register A
load the value at address 50 into register B
add B to A and store the result in B
store the result of B in memory at address 2
You may hope the CPU will correct you and use the right address but since it doesn't even check for this, the computer will happily run the code and add whatever was at address 50 to 10. When there's only 3 variables it's pretty easy to keep track but as you can guess, the more variables you add, the more difficult it is to remember what memory address corresponds to what data.
To help with this problem, we could use something like "address labels", basically allowing us to write the above code in this way:
please replace all instances of "number1" with "address 1"
please replace all instances of "number2" with "address 5"
please replace all instances of "result" with "address 2"
store 10 in memory at number1
store 5 in memory at number2
load the value at number1 into register A
load the value at number2 into register B
add B to A and store the result in B
store the result of B in memory at result
This way, there's no risk to use the wrong address! "number1", "number2" and "result" are meant to help the programmer so they doesn't have to remember where variables are but since the computer only understand addresses we would need to use a special software to convert the easier to understand code to instructions the machine would actually be able to run.
As time went by, we started creating more and more tools for humans to be able to write better code faster, the C programming language is one on them. Exactly like the "please replace" example, C code helps you avoid mistakes and simplifies programming a lot, at the cost of not being understood by the computer at all.
That's why you need to use a special software called a compiler to run your code, it takes your code and compiles it into the different instructions your dumb computer will execute blindly.
Through the answer I only used pseudo-code so just to get a glimpse at how things work in the real world, let's use some real C code and real CPU assembler. Here's a really simple C code to add two numbers:
int main()
{
int a = 10;
int b = 5;
int result = a + b;
return result;
}
and here's the x86_64 ASM code my compiler created from it:
pushq %rbp
movq %rsp, %rbp
movl $10, -12(%rbp)
movl $5, -8(%rbp)
movl -12(%rbp), %edx
movl -8(%rbp), %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax
popq %rbp
ret
Note that the ASM listing is still not "raw", it may be a lot more difficult for us to understand than C but it's still too abstract for the CPU.
This minimal program
int r = 100;
return r;
generates just one instruction in main, before the ret:
mov eax,0x64
There is only an immediate and a register. No address. Good nobody asked for it.
With return (long)&r it is (after some stack-checking):
lea eax,[rsp+0x4]
Here there is no value, only the address of where one could (and would) be stored. It's a real living address, but not one containing "100".
Because C's main() returns int, the eax (not rax) is used, and the aligned half of the current stack position (+0x4). The version:
long r = 100;
return &r;
compiles to simply:
mov eax,esp
(i.e. no calculation with lea needed)
To get rid of the different warnings: return (int)(long)&r. This probably shows that addresses are not meant to be returned outside. Here it is only done to force -O3 to do something at all.
So pointers can be created on demand. But in a real program the compiler already has that address stored somewhere / in use.
An infinite pointer "loop" is prevented by:
error: lvalue required as unary '&' operand
6 | return &(&r);
You'd have to put &r into a fresh variable first, before you can take it's address. Fresh variable means new name (or array index) for the programmer and a new memory address ("lvalue") for the compiler.
I finally resolved my confusion(thanks to all the answers) which basically boils down to this.
In real hardware memory, there is a permanent address of each byte of data. At the software level, the operating system creates virtual addresses that map to those real addresses. It's a one-to-one mapping, for one hardware address, there is only one virtual address at a time.
In C, the compiler creates a hashmap-like data structure with variable names as keys and virtual addresses as values. For every new variable, a new entry is added to the hashmap. Whenever we want to get a virtual address of a variable, it's like asking the compiler, "hey, what is the corresponding virtual address associated with this variable name" and the compiler looks through the hashmap and returns the address corresponding to that variable name.
This is just a simplification, as the details are beyond my knowledge, but it nonetheless relieved my confusion for now.
Related
I'm actually a beginner in assembly (Nios II) and I know that a functions parameters are stored in the registers (r4 -> r7)
But I wonder if f these registers contain the actual value of the parameter or it's adress ?
for example the C function :
int add (int x, int y) {}
Does r4 contain 'x' or '&x' ?
Here's the ABI for Nios II:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/nios2/n2cpu_nii51016.pdf
From the table, we can tell that arguments are passed indeed in registers r4-r7, and each one of them holds 32 bits. From the same document we learn that int is 4 bytes. That means that x will be passed in r4. &x is not passed here, as this is call-by-value. If you want to access the address of x, good compiler will try first to see if it's ever needed, and only after giving up, will allocate memory on the stack frame.
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.
I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.
Consider this function in pseudo code which tends to be valid Rust code ;-)
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(a, b);
doAnotherThing(c, d);
}
This is how I assume the stack to look like on line X:
Stack
a +-------------+
| 1 |
b +-------------+
| 2 |
c +-------------+
| 3 |
d +-------------+
| 4 |
+-------------+
Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.
But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.
So, I wonder what exactly happens behind the scenes?
Another related question. Consider we pass a reference to one of the other functions like this:
fn foo() {
let a = 1;
let b = 2;
let c = 3;
let d = 4;
// line X
doSomething(&a, &b);
doAnotherThing(c, d);
}
From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.
Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.
The call stack could also be called a frame stack.
The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.
Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.
This graphic from Wikipedia shows what the typical call stack is structured like1:
Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.
Example
#include <iostream>
int main()
{
char c = std::cin.get();
std::cout << c;
}
gcc.godbolt.org gives us
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >::get()
movb %al, -1(%rbp)
movsbl -1(%rbp), %eax
movl %eax, %esi
movl std::cout, %edi
call [... the insertion operator for char, long thing... ]
movl $0, %eax
leave
ret
.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:
Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.
Then cin is moved into the EDI register2 and get is called; The return value is in EAX.
So far so good. Now the interesting thing happens:
The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.
Finally,
The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
Replaces the stack pointer with the base pointer and
Pops the base pointer.
After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.
Frame Pointer Omission
It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.
This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.
1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.
2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.
Because obviously, the next thing we need is to work with a and b but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot because it needs c and d in the next line.
In short:
There is no need to pop the arguments. The arguments passed by caller foo to function doSomething and the local variables in doSomething can all be referenced as an offset from the base pointer.
So,
When a function call is made, function's arguments are PUSHed on stack. These arguments are further referenced by base pointer.
When the function returns to its caller, the arguments of the returning function are POPed from the stack using LIFO method.
In detail:
The rule is that each function call results in a creation of a stack frame (with the minimum being the address to return to). So, if funcA calls funcB and funcB calls funcC, three stack frames are set up one on top of the another. When a function returns, its frame becomes invalid. A well-behaved function acts only on its own stack frame and does not trespass on another's. In another words the POPing is performed to the stack frame on the top (when returning from the function).
The stack in your question is setup by caller foo. When doSomething and doAnotherThing are called, then they setup their own stack. The figure may help you to understand this:
Note that, to access the arguments, the function body will have to traverse down (higher addresses) from the location where the return address is stored, and to access the local variables, the function body will have to traverse up the stack (lower addresses) relative to the location where the return address is stored. In fact, typical compiler generated code for the function will do exactly this. The compiler dedicates a register called EBP for this (Base Pointer). Another name for the same is frame pointer. The compiler typically, as the first thing for the function body, pushes the current EBP value on to the stack and sets the EBP to the current ESP. This means, once this is done, in any part of the function code, argument 1 is EBP+8 away (4 bytes for each of caller's EBP and the return address), argument 2 is EBP+12(decimal) away, local variables are EBP-4n away.
.
.
.
[ebp - 4] (1st local variable)
[ebp] (old ebp value)
[ebp + 4] (return address)
[ebp + 8] (1st argument)
[ebp + 12] (2nd argument)
[ebp + 16] (3rd function argument)
Take a look at the following C code for the formation of stack frame of the function:
void MyFunction(int x, int y, int z)
{
int a, int b, int c;
...
}
When caller call it
MyFunction(10, 5, 2);
the following code will be generated
^
| call _MyFunction ; Equivalent to:
| ; push eip + 2
| ; jmp _MyFunction
| push 2 ; Push first argument
| push 5 ; Push second argument
| push 10 ; Push third argument
and the assembly code for the function will be (set-up by callee before returning)
^
| _MyFunction:
| sub esp, 12 ; sizeof(a) + sizeof(b) + sizeof(c)
| ;x = [ebp + 8], y = [ebp + 12], z = [ebp + 16]
| ;a = [ebp - 4] = [esp + 8], b = [ebp - 8] = [esp + 4], c = [ebp - 12] = [esp]
| mov ebp, esp
| push ebp
References:
Function Call Conventions and the Stack.
Frame Pointer and Local Variables.
x86 Disassembly/Functions and Stack Frames.
Like others noted, there is no need to pop parameters, until they go out of scope.
I will paste some example from "Pointers and Memory" by Nick Parlante.
I think the situation is a bit more simple than you envisioned.
Here is code:
void X()
{
int a = 1;
int b = 2;
// T1
Y(a);
// T3
Y(b);
// T5
}
void Y(int p)
{
int q;
q = p + 2;
// T2 (first time through), T4 (second time through)
}
The points in time T1, T2, etc. are marked in
the code and the state of memory at that time is shown in the drawing:
Different processors and languages use a few different stack designs. Two traditional patterns on both the 8x86 and 68000 are called the Pascal calling convention and the C calling convention; each convention is handled the same way in both processors, except for the names of the registers. Each uses two registers to manage the stack and associated variables, called the stack pointer (SP or A7) and the frame pointer (BP or A6).
When calling subroutine using either convention, any parameters are be pushed on the stack before calling the routine. The routine's code then pushes the current value of the frame pointer onto the stack, copies the current value of the stack pointer to the frame pointer, and subtracts from the stack pointer the number of bytes used by local variables [if any]. Once that is done, even if additional data are pushed onto the stack, all local variables will be stored at variables with a constant negative displacement from the stack pointer, and all parameters that were pushed on the stack by the caller may be accessed at a constant positive displacement from the frame pointer.
The difference between the two conventions lies in the way they handle an exit from subroutine. In the C convention, the returning function copies the frame pointer to the stack pointer [restoring it to the value it had just after the old frame pointer was pushed], pops the old frame pointer value, and performs a return. Any parameters the caller had pushed on the stack before the call will remain there. In the Pascal convention, after popping the old frame pointer, the processor pops the function return address, adds to the stack pointer the number of bytes of parameters pushed by the caller, and then goes to the popped return address. On the original 68000 it was necessary to use a 3-instruction sequence to remove the caller's parameters; the 8x86 and all 680x0 processors after the original included a "ret N" [or 680x0 equivalent] instruction which would add N to the stack pointer when performing a return.
The Pascal convention has the advantage of saving a little bit of code on the caller side, since the caller doesn't have to update the stack pointer after a function call. It requires, however, that the called function know exactly how many bytes worth of parameters the caller is going to put on the stack. Failing to push the proper number of parameters onto the stack before calling a function which uses the Pascal convention is almost guaranteed to cause a crash. This is offset, however, by the fact that a little extra code within each called method will save code at the places where the method is called. For that reason, most of the original Macintosh toolbox routines used the Pascal calling convention.
The C calling convention has the advantage of allowing routines to accept a variable number of parameters, and being robust even if a routine doesn't use all the parameters that are passed (the caller will know how many bytes worth of parameters it pushed, and will thus be able to clean them up). Further, it isn't necessary to perform stack cleanup after every function call. If a routine calls four functions in sequence, each of which used four bytes worth of parameters, it may--instead of using an ADD SP,4 after each call, use one ADD SP,16 after the last call to cleanup the parameters from all four calls.
Nowadays the described calling conventions are considered somewhat antiquated. Since compilers have gotten more efficient at register usage, it is common to have methods accept a few parameters in registers rather than requiring that all parameters be pushed on the stack; if a method can use registers to hold all the parameters and local variables, there's no need to use a frame pointer, and thus no need to save and restore the old one. Still, it's sometimes necessary to use the older calling conventions when calling libraries that was linked to use them.
There are already some really good answers here. However, if you are still concerned about the LIFO behavior of the stack, think of it as a stack of frames, rather than a stack of variables. What I mean to suggest is that, although a function may access variables that are not on the top of the stack, it is still only operating on the item at the top of the stack: a single stack frame.
Of course, there are exceptions to this. The local variables of the entire call chain are still allocated and available. But they won't be accessed directly. Instead, they are passed by reference (or by pointer, which is really only different semantically). In this case a local variable of a stack frame much further down can be accessed. But even in this case, the currently executing function is still only operating on its own local data. It is accessing a reference stored in its own stack frame, which may be a reference to something on the heap, in static memory, or further down the stack.
This is the part of the stack abstraction that makes functions callable in any order, and allows recursion. The top stack frame is the only object that is directly accessed by the code. Anything else is accessed indirectly (through a pointer that lives in the top stack frame).
It might be instructive to look at the assembly of your little program, especially if you compile without optimization. I think you will see that all of the memory access in your function happens through an offset from the stack frame pointer, which is the how the code for the function will be written by the compiler. In the case of a pass by reference, you would see indirect memory access instructions through a pointer that is stored at some offset from the stack frame pointer.
The call stack is not actually a stack data structure. Behind the scenes, the computers we use are implementations of the random access machine architecture. So, a and b can be directly accessed.
Behind the scenes, the machine does:
get "a" equals reading the value of the fourth element below stack top.
get "b" equals reading the value of the third element below stack top.
http://en.wikipedia.org/wiki/Random-access_machine
Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:
And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.
The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp pointing to the first byte after the return address will be aligned to 16 bytes (because rsp was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart allocates the return address itself as null, so you see a sub rsp, 48h and not sub rsp, 40h in the prologue, because there is no call to RtlUserThreadStart, it just begins execution at that rip at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):
If you remove the call to func(), it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp, rsp needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess is required because the function does not make a call, then there will be no alignment requirement as rsp does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp is not used on the x64 Windows calling convention unless alloca is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp is used, and the new rbp points to the first byte after the old rbp (only if compiling using -O0, because it is saved to the stack on -O0, otherwise, rbp will point to the first byte after the return address). On these calling conventions, if compiling using -O0, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp and part of the stack reservation done by the rsp decrement. Data within the stack reservation done by the rsp decrement is accessed relative rbp rather than rsp, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0. If it is not a varargs function then on -O1, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0 (which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):
Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx for the callee to write to when the return type of the function it is calling cannot be passed in rax. On -O0, if the return value cannot be passed in rax, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast has 2 / 1 copies and -O0 has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.
Is there any performance difference when we access a memory location by using a pointer and double pointer?
If so, which one is faster ?
There is no simple answer it, as the answer might depend in the actual machine. If I remember correctly some legacy machines (such as PDP11) offered a 'double pointer' access in a single instruction.
However, this is not the situation today. accessing memory is not as simple as it looks and requires a lot of work, due to virtual memory. For this reason - my guess is that double reference should in fact be slower on most modern machines - more work has to be done to translate two addresses from virtual addresses to physical addresses and retrieving them - but that's just educated guess.
Note however, that the compiler might optimize 'redundant' accesses for you already.
For my best knowledge however, there is no machine that has faster 'double access' than 'single access', so we can say that single access is not worse than double access.
As a side note, I believe in real life programs, the difference is neglectable (comparing to anything else done in the program), and unless done in a very performance sensitive loop - just do whatever is more readable. Also, the compiler might optimize it for you already if it can.
Assuming you are talking about something like
int a = 10;
int *aptr = &a;
int **aptrptr = &aptr;
Then the cost of
*aptr = 20;
Is one dereference. The address pointed to by aptr must first be retrieved and then the address can be stored to.
The cost of
**aptrptr = 30;
Is two dereferences. The address pointed to by aptrptr must first be retrieved. Then the addess stored in that address must be retrieved. Then this address can be stored to.
Is this what you were asking?
Therefore, to conclude using a single pointer is faster if that suits your needs.
Note, that if you access a pointer or double pointer in a loop, for example,
while(some condition)
*aptr = something;
or
while(some condition)
**aptrptr = something;
The compiler will likely optimize so that the dereferencing is only done once at the start of the loop, so the cost is only 1 extra address fetch rather than N, where N is the numnber of times the loop executes.
EDIT:
(1) As Amit correctly points out the "how" of pointer access is not explicitly a C thing... it does depend on the underlying architecture. If your machine supports a double dereference as a single instruction then there might not be a big difference. He is using the index deferred addressing mode of the PDP11 as an example. You might find out that such an instruction still chews up more cycles... consult the hardware documentation and look at the optimization that your C compiler is able to apply for your specific architecture.
The PDP11 architecture is circa the 1970s. As far as I know (if someone knows are modern architecture that can do this pleas post!), most RISC architectures and don't have such a double dereference and will probably need to do two fetches as far as I am aware.
Therefore, to conclude using a single pointer is probably faster generally, but with the caveat that specific architectures may handle this better than others and compiler optimizations, as I discussed, could make the difference negligible... to be sure you just have to profile your code and read up about your architecture :)
Let's see it in this way:
int var = 5;
int *ptr_to_var = &var;
int **ptr_to_ptr = &ptr;
When the variable var is accessed then you need to
1.get the address of the variable
2.fetch its value from that address.
In case of pointer ptr_to_var you need to
1.get the address of the pointer variable
2.fetch its value from that address (i.e, address of the variable var)
3.fetch the value at the address pointed to.
In third case, pointer to pointer to int variable ptr_to_ptr, you need to
1.get the address of the pointer to pointer variable
2.fetch its value from that address (i.e, address of the pointer to variable ptr_var)
3.again fetch its value from the address fetched in the second step(i.e, address of the variable var)
4.fetch the value at the address pointed to.
So we can say that accessing via pointer to pointer variable is slower than that of pointer variable which in turn slower than that of normal variable accessing.
I got curious and set up the following scenario:
int v = 0;
int *pv = &v;
int **ppv = &pv;
I tried dereferencing the pointers and took a look at the disassembly, which showed the following:
int x;
x = *pv;
00B33C5B mov eax,dword ptr [pv]
00B33C5E mov ecx,dword ptr [eax]
00B33C60 mov dword ptr [x],ecx
x = **ppv;
00B33C63 mov eax,dword ptr [ppv]
00B33C66 mov ecx,dword ptr [eax]
00B33C68 mov edx,dword ptr [ecx]
00B33C6A mov dword ptr [x],edx
You can see that there is an additional mov instruction for dereferencing there so my best guess is: double dereferencing is inevitably slower.
If the program counter points to the address of the next instruction to be executed, what do frame pointers do?
It's like a more stable version of the stack pointer
Storage for some local variables and parameters are generally allocated in stack frames that are automatically freed simply by popping the stack pointer back to its original level after a function call.
However, the stack pointer is frequently being adjusted in order to push arguments on to the stack for new call levels and at least once on entry to a method in order to allocate its own local variables. There are other more obscure reasons to adjust the stack pointer.
All of this adjusting complicates the use of offsets to get to the parameters, locals, and in some languages, intermediate lexical scopes. It is perhaps not too hard for the compiler to keep track but if the program is being debugged, then a debugger (human or program) must also keep track of the changing offset.
It is simpler, if technically an unnecessary overhead, to just allocate a register to point to the current frame. On x86 this is %ebp. On entry to a function it may have a fixed relationship to the stack pointer.
Besides debugging, this simplifies exception management and may even pay for itself by eliminating or optimizing some adjustments to the stack pointer.
You mentioned the program counter, so it's worth noting that generally the frame pointer is an entirely software construct, and not something that the hardware implements except to the extent that virtually every machine can do a register + offset addressing mode. Some machines like x86 do provide some hardware support in the form of addressing modes and macro instructions for creating and restoring frames. However, sometimes it is found that the core instructions are faster and the macro ops end up deprecated.
This isn't really a C question since it's totally dependent on the compiler.
However stack frames are a useful way to think about the current function and it's parent function. Typically a frame pointer points to a specific location on the stack (for the given stack depth) from which you can locate parameters that were passed in as well as local variables.
Here's an example, let's say you call a function which takes one argument and returns the sum of all numbers between 1 and that argument. The C code would be something like:
unsigned int x = sumOf (7);
: :
unsigned int sumOf (unsigned int n) {
unsigned int total = 0;
while (n > 0) {
total += n;
n--;
}
return total;
}
In order to call this function, the caller will push 7 onto the stack then call the subroutine. The function itself sets up the frame pointer and allocates space for local variables, so you may see the code:
mov r1,7 ; fixed value
push r1 ; push it for subroutine
call sumOf ; then call
retLoc: mov [x],r1 ; move return value to variable
: :
sumOf: mov fp,sp ; Set frame pointer to known location
sub sp,4 ; Allocate space for total.
: :
At that point (following the sub sp,4), you have the following stack area:
+--------+
| n(7) |
+--------+
| retLoc |
+--------+
fp -> | total |
+--------+
sp -> | |
+--------+
and you can see that you can find passed-in parameters by using addresses 'above' the frame pointer and local variables 'below' the frame pointer.
The function can access the passed in value (7) by using [fp+8], the contents of memory at fp+8 (each of those cells is four bytes in this example). It can also access its own local variable (total) with [fp-0], the contents of memory at fp-0. I've used the fp-0 nomenclature even though subtracting zero has no effect since other locals will have corresponding lower addresses like fp-4, fp-8 and so on.
As you move up and down the stack, the frame pointer also moves and it's typical that the previous frame pointer is pushed onto the stack before calling a function, to give easy recovery when leaving that function. But, whereas the stack pointer may move wildly while within a function, the frame pointer typically stays constant so you can always find your relevant variables.
Good discussion here, with examples and all.
In short: the FP points to a fixed spot within the function's frame on the stack (and does not change during function execution), so all passed-arguments and the function's local ("auto") variables can be accessed by offsets from the FP (while the SP can change during a function's execution, and the PC definitely does;-).
Usually the return address (but sometimes just past last argument, for example). The point is that the frame pointer is fixed during the life of a method while the stack pointer could move during execution.
This is very implementation dependent (and more a machine concept, not really a language concept).
Lifted from a comment you provided to another answer:
Woh... Stack Pointer?... is that synonymous to Program Counter?
Read about the call stack. Basically the call stack stores data local to a current method (local variables, parameters to the method and return address to the caller). The stack pointer points to the top of that structure which is where new space is allocated (by moving the stack pointer "higher").
The frame pointer points to an area of memory in the current frame (current local function), typically it points to the return address of the current local function.
Since no one has responded to this yet I'll give it a try. A frame pointer (if memory serves) is part of the stack along with the stack pointer. The stack is comprised of stack frames (sometimes called activation records). The stack pointer points to the top of the stack while the frame pointer typically points to some fixed point in a frame structure, such as the location of the return address. Theres a more detailed description along with a picture on wikipedia.
link text