Different ways how gcc allocate memory for stack [duplicate] - c

This question already has answers here:
Why does the compiler reserve a little stack space but not the whole array size?
(2 answers)
Stack allocation, padding, and alignment
(6 answers)
Closed 2 years ago.
How does gcc decides how much memory allocate for stack and why does it not decrement %rsp anymore when I remove printf() (or any function call) from my main?
1. I noticed when I played around with a code sample: https://godbolt.org/z/fQqkNE that the 6th line in gcc assembly viewer subq $48, %rsp gets removed if I remove printf() from my C code on line 22. It looks like when I don’t make any function calls from within my main, then the %rsp does not get decremented, but data still gets allocated based on %rbp and offsets. I thought %rsp changes only when stack grows. My theory is that since it won’t make any other function calls, it knows that it won’t need to keep stack for other nonexistent functions. But shouldn’t %rsp still grow as data is getting saved?
2. When adding variables to my rect struct, I also noticed that it sometimes allocates memory in steps greater than what the added data type size was. What is the convention it follows when deciding how much memory to allocate to stack?
3. Is there an online tool that would take assembly code as input, and then draw an image of stack and tell me state of every register at any point of execution? Godbolt.org is a very good tool, I just wish it had these 2 extra features.
I'll paste the code below in case the link to godbolt stops working in the future:
#include <stdio.h>
#include <stdint.h>
struct rect {
int a;
int b;
int* c;
int d[2];
uint8_t f;
};
int main() {
int arr[2] = {2, 3};
struct rect Rect;
Rect.a = 10;
Rect.b = 20;
Rect.c = arr;
Rect.d[0] = Rect.a;
Rect.d[1] = Rect.b;
Rect.f =255;
printf("%d and %d", Rect.a, Rect.b);
return 0;
}
.LC0:
.string "%d and %d"
main:
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl $2, -8(%rbp)
movl $3, -4(%rbp)
movl $10, -48(%rbp)
movl $20, -44(%rbp)
leaq -8(%rbp), %rax
movq %rax, -40(%rbp)
movl -48(%rbp), %eax
movl %eax, -32(%rbp)
movl -44(%rbp), %eax
movl %eax, -28(%rbp)
movb $-1, -24(%rbp)
movl -44(%rbp), %edx
movl -48(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
P.S.: The book I follow uses AT&T syntax for teaching x86. Which is weird because it makes finding online tutorials much harder.

Related

Why does GCC allocate more stack memory than needed?

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:
long call_proc() {
long x1 = 1;
int x2 = 2;
short x3 = 3;
char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}
The book gives the assembly code generated by GCC:
long call_proc()
call_proc:
; Set up arguments to proc
subq $32, %rsp ; Allocate 32-byte stack frame
movq $1, 24(%rsp) ; Store 1 in &x1
movl $2, 20(%rsp) ; Store 2 in &x2
movw $3, 18(%rsp) ; Store 3 in &x3
movb $4, 17(%rsp) ; Store 4 in &x4
leaq 17(%rsp), %rax ; Create &x4
movq %rax, 8(%rsp) ; Store &x4 as argument 8
movl $4, (%rsp) ; Store 4 as argument 7
leaq 18(%rsp), %r9 ; Pass &x3 as argument 6
movl $3, %r8d ; Pass 3 as argument 5
leaq 20(%rsp), %rcx ; Pass &x2 as argument 4
movl $2, %edx ; Pass 2 as argument 3
leaq 24(%rsp), %rsi ; Pass &x1 as argument 2
movl $1, %edi ; Pass 1 as argument 1
; Call proc
call proc
; Retrieve changes to memory
movslq 20(%rsp), %rdx ; Get x2 and convert to long
addq 24(%rsp), %rdx ; Compute x1+x2
movswl 18(%rsp), %eax ; Get x3 and convert to int
movsbl 17(%rsp), %ecx ; Get x4 and convert to int
subl %ecx, %eax ; Compute x3-x4
cltq ; Convert to long
imulq %rdx, %rax ; Compute (x1+x2) * (x3-x4)
addq $32, %rsp ; Deallocate stack frame
ret ; Return
I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc and the last 16 bytes hold 4 local variables.
Then I tested this code on GCC 11.2, using the optimization flag -Og, and got this assembly code:
call_proc():
subq $24, %rsp
movq $1, 8(%rsp)
movl $2, 4(%rsp)
movw $3, 2(%rsp)
movb $4, 1(%rsp)
leaq 1(%rsp), %rax
pushq %rax
pushq $4
leaq 18(%rsp), %r9
movl $3, %r8d
leaq 20(%rsp), %rcx
movl $2, %edx
leaq 24(%rsp), %rsi
movl $1, %edi
call proc(long, long*, int, int*, short, short*, char, char*)
movslq 20(%rsp), %rax
addq 24(%rsp), %rax
movswl 18(%rsp), %edx
movsbl 17(%rsp), %ecx
subl %ecx, %edx
movslq %edx, %rdx
imulq %rdx, %rax
addq $40, %rsp
ret
I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq to add 2 arguments to the stack, so the final code uses addq $40, %rsp to free stack space.
Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?
(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)
GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.
So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.
Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.
One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).
CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.
Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.

Why asm generated by gcc mov twice?

Suppose I have the following C code:
#include
int main()
{
int x = 11;
int y = x + 3;
printf("%d\n", x);
return 0;
}
Then I compile it into asm using gcc, I get this(with some flag removed):
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $11, -4(%rbp)
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
My problem is why it is movl -4(%rbp), %eax followed by movl %eax, %esi, rather than a simple movl -4(%rbp), %esi(which works well according to my experiment)?
You probably did not enable optimizations.
Without optimization the compiler will produce code like this. For one it does not allocate data to registers, but on the stack. This means that when you operate on variables they will first be transferred to a register and then operated on.
So given that x lives is allocated in -4(%rbp) and this is what the code appears as if you translate it directly without optimization. First you move 11 to the storage of x. This means:
movl $11, -4(%rbp)
done with the first statement. The next statement is to evaluate x+3 and place in the storage of y (which is -8(%rbp), this is done without regard of the previous generated code:
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
done with the second statement. By the way that is divided into two parts: evaluation of x+3 and the storage of the result. Then the compiler continues to generate code for the printf statement, again without taking earlier statements into account.
If you on the other hand would enable optimization the compiler does a number of smart and to humans obvious things. One thing is that it allows variables to be allocated to registers, or at least keep track on where one can find the value of the variable. In this case the compiler would for example know in the second statement that x is not only stored at -4(%ebp) it will also know that it is stored in $11 (yes it nows it's actual value). It can then use this to add 3 to it which means it knows the result to be 14 (but it's smarter that that - it has also seen that you didn't use that variable so it skips that statement entirely). Next statement is the printf statement and here it can use the fact that it knows x to be 11 and pass that directly to printf. By the way it also realizes that it doesn't get to use the storage of x at -4(%ebp). Finally it may know what printf does (since you included stdio.h) so can analyze the format string and do the conversion at compile time to replace the printf statement to a call that directly writes 14 to standard out.

Can't set stack boundary gcc

My c code:
#include <stdio.h>
foo()
{
char buffer[8];
}
main()
{
foo();
return 0;
}
I compile it using gcc -ggdb -mpreferred-stack-boundary=2 -o bar bar.c
When I load it using GDB ./bar I see that inside the foo function the code is:
sub $0x0c,$esp
Why is this happening?
I want to buffer to take 8 bytes in the stack so it should be sub $0x8,$esp!
Why can't I set stack boundary to 4 bytes?
Help!
I can't reproduce exactly what you are seeing, but on my 4.8.2 version of gcc, the option does affect the amount of stack used with this code (make sure "buffer" is used to avoid it being optimised away, and fix the warnings for no return type/argument types):
#include <stdio.h>
void foo(void)
{
char buffer[8];
buffer[0] = 'a';
buffer[1] = '\n';
buffer[2] = 0;
printf("my first program! %s\n", buffer);
}
int main()
{
foo();
return 0;
}
Compiled with -mpreferred-stack-boundary=2 and -mpreferred-stack-boundary=4, and the difference between the generated assembler is notable:
$ diff -u stb-2.s stb-4.s
--- stb-2.s 2014-04-10 09:00:39.546038191 +0100
+++ stb-4.s 2014-04-10 09:00:58.895108979 +0100
## -15,11 +15,11 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
- subl $16, %esp
- movb $97, -8(%ebp)
- movb $10, -7(%ebp)
- movb $0, -6(%ebp)
- leal -8(%ebp), %eax
+ subl $40, %esp
+ movb $97, -16(%ebp)
+ movb $10, -15(%ebp)
+ movb $0, -14(%ebp)
+ leal -16(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
.LEHB0:
## -67,9 +67,10 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
+ andl $-16, %esp
call _Z3foov
movl $0, %eax
- popl %ebp
+ leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
So, at least in gcc 4.8.2. for x86-32, the option has an effect.
Of course, the default according to the docs is -mpreferred-stack-boundary=2, so maybe that's why you can't see any difference from "without" (Although in my experiments, it seems that it's -mpreferred-stack-boundary=4). [Moment passes] Ah, the default has been changed over time, so the 4.4.2 docs online says 2, my info gcc for 4.8.2 says 4, which explains the difference.
As to why your code is allocating twelve bytes of stack-space - look at how printf is called:
movl $.LC0, (%esp)
call printf
If the compiler can, it will pre-allocate argument space for function calls at the start of the function, rather than use push $.LC0 as it would be in this case. It's not much difference, but it saves at least one instruction for cleanup at the other side of printf (and it makes it MUCH easier to deal with stack-relative offsets within the produced code, since the compiler doesn't have to keep track of where the current stack-pointer is - it's always at a constant place after the prologue code at the beginning of the function, all the way to the end of the function). Since the space is ultimately required anyway, there's no point in "saving 4 bytes".

Why doesn't the compiler allocate and deallocate local var with "sub*" and "add*" on the stack?

According to some textbooks, the compiler will use sub* to allocate memory for local variables.
For example, I write a Hello World program:
int main()
{
puts("hello world");
return 0;
}
I guess this will be compiled to some assembly code on the 64 bit OS:
subq $8, %rsp
movq $.LC0, (%rsp)
calq puts
addq $8, %rsp
The subq allocates 8 byte memory (size of a point) for the argument and the addq deallocates it.
But when I input gcc -S hello.c (I use the llvm-gcc on Mac OS X 10.8), I get some assembly code.
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
xorb %al, %al
leaq L_.str(%rip), %rcx
movq %rcx, %rdi
callq _puts
movl $0, -8(%rbp)
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
.......
L_.str:
.asciz "hello world!"
Around this callq without any addq and subq. Why? And what is the function of addq $16, %rsp?
Thanks for any input.
You don't have any local variables in your main(). All you may have in it is a pseudo-variable for the parameter passed to puts(), the address of the "hello world" string.
According to your last disassembly, the calling conventions appear to be such that the first parameter to puts() is passed in the rdi register and not on the stack, which is why there isn't any stack space allocated for this parameter.
However, since you're compiling your program with optimization disabled, you may encounter some unnecessary stack space allocations and reads and writes to and from that space.
This code illustrates it:
subq $16, %rsp ; allocate some space
...
movl $0, -8(%rbp) ; write to it
movl -8(%rbp), %eax ; read back from it
movl %eax, -4(%rbp) ; write to it
movl -4(%rbp), %eax ; read back from it
addq $16, %rsp
Those four mov instructions are equivalent to just one simple movl $0, %eax, no memory is needed to do that.
If you add an optimization switch like -O2 in your compile command, you'll see more meaningful code in the disassembly.
Also note that some space allocations may be needed solely for the purpose of keeping the stack pointer aligned, which improves performance or avoids issues with misaligned memory accesses (you could get the #AC exception on misaligned accesses if it's enabled).
The above code shows it too. See, those four mov instructions only use 8 bytes of memory, while the add and sub instructions grow and shrink the stack by 16.

Decoding equivalent assembly code of C code

Wanting to see the output of the compiler (in assembly) for some C code, I wrote a simple program in C and generated its assembly file using gcc.
The code is this:
#include <stdio.h>
int main()
{
int i = 0;
if ( i == 0 )
{
printf("testing\n");
}
return 0;
}
The generated assembly for it is here (only the main function):
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
leave
ret
I am at an absolute loss to correlate the C code and assembly code. All that the code has to do is store 0 in a register and compare it with a constant 0 and take suitable action. But what is going on in the assembly?
Since main is special you can often get better results by doing this type of thing in another function (preferably in it's own file with no main). For example:
void foo(int x) {
if (x == 0) {
printf("testing\n");
}
}
would probably be much more clear as assembly. Doing this would also allow you to compile with optimizations and still observe the conditional behavior. If you were to compile your original program with any optimization level above 0 it would probably do away with the comparison since the compiler could go ahead and calculate the result of that. With this code part of the comparison is hidden from the compiler (in the parameter x) so the compiler can't do this optimization.
What the extra stuff actually is
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
This is setting up a stack frame for the current function. In x86 a stack frame is the area between the stack pointer's value (SP, ESP, or RSP for 16, 32, or 64 bit) and the base pointer's value (BP, EBP, or RBP). This is supposedly where local variables live, but not really, and explicit stack frames are optional in most cases. The use of alloca and/or variable length arrays would require their use, though.
This particular stack frame construction is different than for non-main functions because it also makes sure that the stack is 16 byte aligned. The subtraction from ESP increases the stack size by more than enough to hold local variables and the andl effectively subtracts from 0 to 15 from it, making it 16 byte aligned. This alignment seems excessive except that it would force the stack to also start out cache aligned as well as word aligned.
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
I don't know what all this does. alloca increases the stack frame size by altering the value of the stack pointer.
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
I think you know what this does. If not, the movl just befrore the call is moving the address of your string into the top location of the stack so that it may be retrived by printf. It must be passed on the stack so that printf can use it's address to infer the addresses of printf's other arguments (if any, which there aren't in this case).
leave
This instruction removes the stack frame talked about earlier. It is essentially movl %ebp, %esp followed by popl %ebp. There is also an enter instruction which can be used to construct stack frames, but gcc didn't use it. When stack frames aren't explicitly used, EBP may be used as a general puropose register and instead of leave the compiler would just add the stack frame size to the stack pointer, which would decrease the stack size by the frame size.
ret
I don't need to explain this.
When you compile with optimizations
I'm sure you will recompile all fo this with different optimization levels, so I will point out something that may happen that you will probably find odd. I have observed gcc replacing printf and fprintf with puts and fputs, respectively, when the format string did not contain any % and there were no additional parameters passed. This is because (for many reasons) it is much cheaper to call puts and fputs and in the end you still get what you wanted printed.
Don't worry about the preamble/postamble - the part you're interested in is:
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
It should be pretty self-evident as to how this correlates with the original C code.
The first part is some initialization code, which does not make any sense in the case of your simple example. This code would be removed with an optimization flag.
The last part can be mapped to C code:
movl $0, -4(%ebp) // put 0 into variable i (located at -4(%ebp))
cmpl $0, -4(%ebp) // compare variable i with value 0
jne L2 // if they are not equal, skip to after the printf call
movl $LC0, (%esp) // put the address of "testing\n" at the top of the stack
call _printf // do call printf
L2:
movl $0, %eax // return 0 (calling convention: %eax has the return code)
Well, much of it is the overhead associated with the function. main() is just a function like any other, so it has to store the return address on the stack at the start, set up the return value at the end, etc.
I would recommend using GCC to generate mixed source code and assembler which will show you the assembler generated for each sourc eline.
If you want to see the C code together with the assembly it was converted to, use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst
See http://www.delorie.com/djgpp/v2faq/faq8_20.html
On linux, just use gcc. On Windows down load Cygwin http://www.cygwin.com/
Edit - see also this question Using GCC to produce readable assembly?
and http://oprofile.sourceforge.net/doc/opannotate.html
You need some knowledge about Assembly Language to understand assembly garneted by C compiler.
This tutorial might be helpful
See here more information. You can generate the assembly code with C comments for better understanding.
gcc -g -Wa,-adhls your_c_file.c > you_asm_file.s
This should help you a little.

Resources