My c code:
#include <stdio.h>
foo()
{
char buffer[8];
}
main()
{
foo();
return 0;
}
I compile it using gcc -ggdb -mpreferred-stack-boundary=2 -o bar bar.c
When I load it using GDB ./bar I see that inside the foo function the code is:
sub $0x0c,$esp
Why is this happening?
I want to buffer to take 8 bytes in the stack so it should be sub $0x8,$esp!
Why can't I set stack boundary to 4 bytes?
Help!
I can't reproduce exactly what you are seeing, but on my 4.8.2 version of gcc, the option does affect the amount of stack used with this code (make sure "buffer" is used to avoid it being optimised away, and fix the warnings for no return type/argument types):
#include <stdio.h>
void foo(void)
{
char buffer[8];
buffer[0] = 'a';
buffer[1] = '\n';
buffer[2] = 0;
printf("my first program! %s\n", buffer);
}
int main()
{
foo();
return 0;
}
Compiled with -mpreferred-stack-boundary=2 and -mpreferred-stack-boundary=4, and the difference between the generated assembler is notable:
$ diff -u stb-2.s stb-4.s
--- stb-2.s 2014-04-10 09:00:39.546038191 +0100
+++ stb-4.s 2014-04-10 09:00:58.895108979 +0100
## -15,11 +15,11 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
- subl $16, %esp
- movb $97, -8(%ebp)
- movb $10, -7(%ebp)
- movb $0, -6(%ebp)
- leal -8(%ebp), %eax
+ subl $40, %esp
+ movb $97, -16(%ebp)
+ movb $10, -15(%ebp)
+ movb $0, -14(%ebp)
+ leal -16(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
.LEHB0:
## -67,9 +67,10 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
+ andl $-16, %esp
call _Z3foov
movl $0, %eax
- popl %ebp
+ leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
So, at least in gcc 4.8.2. for x86-32, the option has an effect.
Of course, the default according to the docs is -mpreferred-stack-boundary=2, so maybe that's why you can't see any difference from "without" (Although in my experiments, it seems that it's -mpreferred-stack-boundary=4). [Moment passes] Ah, the default has been changed over time, so the 4.4.2 docs online says 2, my info gcc for 4.8.2 says 4, which explains the difference.
As to why your code is allocating twelve bytes of stack-space - look at how printf is called:
movl $.LC0, (%esp)
call printf
If the compiler can, it will pre-allocate argument space for function calls at the start of the function, rather than use push $.LC0 as it would be in this case. It's not much difference, but it saves at least one instruction for cleanup at the other side of printf (and it makes it MUCH easier to deal with stack-relative offsets within the produced code, since the compiler doesn't have to keep track of where the current stack-pointer is - it's always at a constant place after the prologue code at the beginning of the function, all the way to the end of the function). Since the space is ultimately required anyway, there's no point in "saving 4 bytes".
Related
This question already has answers here:
Why does the compiler reserve a little stack space but not the whole array size?
(2 answers)
Stack allocation, padding, and alignment
(6 answers)
Closed 2 years ago.
How does gcc decides how much memory allocate for stack and why does it not decrement %rsp anymore when I remove printf() (or any function call) from my main?
1. I noticed when I played around with a code sample: https://godbolt.org/z/fQqkNE that the 6th line in gcc assembly viewer subq $48, %rsp gets removed if I remove printf() from my C code on line 22. It looks like when I don’t make any function calls from within my main, then the %rsp does not get decremented, but data still gets allocated based on %rbp and offsets. I thought %rsp changes only when stack grows. My theory is that since it won’t make any other function calls, it knows that it won’t need to keep stack for other nonexistent functions. But shouldn’t %rsp still grow as data is getting saved?
2. When adding variables to my rect struct, I also noticed that it sometimes allocates memory in steps greater than what the added data type size was. What is the convention it follows when deciding how much memory to allocate to stack?
3. Is there an online tool that would take assembly code as input, and then draw an image of stack and tell me state of every register at any point of execution? Godbolt.org is a very good tool, I just wish it had these 2 extra features.
I'll paste the code below in case the link to godbolt stops working in the future:
#include <stdio.h>
#include <stdint.h>
struct rect {
int a;
int b;
int* c;
int d[2];
uint8_t f;
};
int main() {
int arr[2] = {2, 3};
struct rect Rect;
Rect.a = 10;
Rect.b = 20;
Rect.c = arr;
Rect.d[0] = Rect.a;
Rect.d[1] = Rect.b;
Rect.f =255;
printf("%d and %d", Rect.a, Rect.b);
return 0;
}
.LC0:
.string "%d and %d"
main:
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl $2, -8(%rbp)
movl $3, -4(%rbp)
movl $10, -48(%rbp)
movl $20, -44(%rbp)
leaq -8(%rbp), %rax
movq %rax, -40(%rbp)
movl -48(%rbp), %eax
movl %eax, -32(%rbp)
movl -44(%rbp), %eax
movl %eax, -28(%rbp)
movb $-1, -24(%rbp)
movl -44(%rbp), %edx
movl -48(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
P.S.: The book I follow uses AT&T syntax for teaching x86. Which is weird because it makes finding online tutorials much harder.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I know that 'static' is about scope, but I've got a question: what function/variable will be faster to access: a 'static' one or not?
Which code will be faster:
#include <stdio.h>
int main(){
int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
or
#include <stdio.h>
int main(){
static int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
In my code I'm working with VERY big numbers (with unsigned long long) and I'm accessing and increasing them about 4.000.000 times a second. This code is not the one I'm working on, it's just an example.
As a sign of good will, I have made up a program that we can actually reason about.
#include <stdint.h>
#include <stdio.h>
int
main()
{
static const uint64_t a = 1664525UL;
static const uint64_t c = 1013904223UL;
static const uint64_t m = (1UL << 31);
static uint32_t x = 1;
register unsigned i;
for (i = 0; i < 1000000000U; ++i)
x = (a * x + c) % m;
printf("%d\n", x);
return 0;
}
It will simply compute the one billionth element of a pseudo random sequence returned by a simple linear congruential generator. We have to do something more difficult than simply increment a counter or the compiler will optimize the entire loop out of existence.
Here is how I have compiled (GCC 4.9.1 on x86_64 GNU/Linux):
$ gcc -o non-static -Dstatic= -Wall -O3 main.c
$ gcc -o static -Wall -O3 main.c
To get the version without static, we simply #define it away on the compiler command line.
Running both programs took 2.36 seconds meaning there is no measurable performance difference.
To find out why, I like to look at the assembly code.
$ gcc -S -o non-static.s -Dstatic= -Wall -O3 main.c
$ gcc -S -o static.s -Wall -O3 main.c
We find that GCC generated identical machine code for the inner loop and moved the special treatment for the static variables out of the loop, which is what we should have expected from a good compiler.
Relevant code with static:
main:
.LFB11:
.cfi_startproc
movl x.2266(%rip), %esi
movl $1000000000, %eax
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
movl %esi, x.2266(%rip)
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
and without:
main:
.LFB11:
.cfi_startproc
movl $1000000000, %eax
movl $1, %esi
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
This just re-emphasizes what many have tried to express in their comments: We need actual code to reason about performance and we should really benchmark it while doing so.
Also, you shouldn't worry too much about such things and trust your compiler most of the time. Focus on writing readable and maintainable code and only fiddle with the dirty details if you have evidence that it is necessary to achieve the required performance. In your particular example, I cannot see any valid reason to declare the local variables static. It disturbs me as a reader and should not be done.
C code:
#include <stdio.h>
main() {
int i;
for (i = 0; i < 10; i++) {
printf("%s\n", "hello");
}
}
ASM:
.file "simple_loop.c"
.section .rodata
.LC0:
.string "hello"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp # push ebp onto stack
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp # setup base pointer or stack ?
.cfi_def_cfa_register 5
andl $-16, %esp # ?
subl $32, %esp # ?
movl $0, 28(%esp) # i = 0
jmp .L2
.L3:
movl $.LC0, (%esp) # point stack pointer to "hello" ?
call puts # print "hello"
addl $1, 28(%esp) # i++
.L2:
cmpl $9, 28(%esp) # if i < 9
jle .L3 # goto l3
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
So I am trying to improve my understanding of x86 assembly code. For the above code, I marked off what I believe I understand. As for the question marked content, could someone share some light? Also, if any of my comments are off, please let me know.
andl $-16, %esp # ?
subl $32, %esp # ?
This reserves some space on the stack. First, the andl instruction rounds the %esp register down to the next lowest multiple of 16 bytes (exercise: find out what the binary value of -16 is to see why). Then, the subl instruction moves the stack pointer down a bit further (32 bytes), reserving some more space (which it will use next). I suspect this rounding is done so that access through the %esp register is slightly more efficient (but you'd have to inspect your processor data sheets to figure out why).
movl $.LC0, (%esp) # point stack pointer to "hello" ?
This places the address of the string "hello" onto the stack (this instruction doesn't change the value of the %esp register itself). Apparently your compiler considers it more efficient to move data onto the stack directly, rather than to use the push instruction.
Let's say I have following code:
int f() {
int foo = 0;
int bar = 0;
foo++;
bar++;
// many more repeated operations in actual code
foo++;
bar++;
return foo+bar;
}
Abstracting repeated code into a separate functions, we get
static void change_locals(int *foo_p, int *bar_p) {
*foo_p++;
*bar_p++;
}
int f() {
int foo = 0;
int bar = 0;
change_locals(&foo, &bar);
change_locals(&foo, &bar);
return foo+bar;
}
I'd expect the compiler to inline the change_locals function, and optimize things like *(&foo)++ in the resulting code to foo++.
If I remember correctly, taking address of a local variable usually prevents some optimizations (e.g. it can't be stored in registers), but does this apply when no pointer arithmetic is done on the address and it doesn't escape from the function? With a larger change_locals, would it make a difference if it was declared inline (__inline in MSVC)?
I am particularly interested in behavior of GCC and MSVC compilers.
inline (and all its cousins _inline, __inline...) are ignored by gcc. It might inline anything it decides is an advantage, except at lower optimization levels.
The code procedure by gcc -O3 for x86 is:
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
.ident "GCC: (GNU) 4.4.4 20100630 (Red Hat 4.4.4-10)"
It returns zero because *ptr++ doesn't do what you think. Correcting the increments to:
(*foo_p)++;
(*bar_p)++;
results in
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
movl $4, %eax
movl %esp, %ebp
popl %ebp
ret
So it directly returns 4. Not only did it inline them, but it optimized the calculations away.
Vc++ from vs 2005 provides similar code, but it also created unreachable code for change_locals(). I used the command line
/O2 /FD /EHsc /MD /FA /c /TP
If I remember correctly, taking
address of a local variable usually
prevents some optimizations (e.g. it
can't be stored in registers), but
does this apply when no pointer
arithmetic is done on the address and
it doesn't escape from the function?
The general answer is that if the compiler can ensure that no one else will change a value behind its back, it can safely be placed in a register.
Think of this as though the compiler first performs inlining, then transforms all those *&foo (which results from the inlining) to simply foo before deciding if they should be placed in registers on in memory on the stack.
With a larger change_locals, would it
make a difference if it was declared
inline (__inline in MSVC)?
Again, generally speaking, whether or not a compiler decides to inline something is done using heuristics. If you explicitly specify that you want something to be inlines, the compiler will probably weight this into its decision process.
I've tested gcc 4.5, MSC and IntelC using this:
#include <stdio.h>
void change_locals(int *foo_p, int *bar_p) {
(*foo_p)++;
(*bar_p)++;
}
int main() {
int foo = printf("");
int bar = printf("");
change_locals(&foo, &bar);
change_locals(&foo, &bar);
printf( "%i\n", foo+bar );
}
And all of them did inline/optimize the foo+bar value, but also did
generate the code for change_locals() (but didn't use it).
Unfortunately, there's still no guarantee that they'd do the same for
any kind of such a "local function".
gcc:
__Z13change_localsPiS_:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
incl (%edx)
incl (%eax)
leave
ret
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
pushl %ebx
subl $28, %esp
call ___main
movl $LC0, (%esp)
call _printf
movl %eax, %ebx
movl $LC0, (%esp)
call _printf
leal 4(%ebx,%eax), %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
addl $28, %esp
popl %ebx
leave
ret
Is there any substantial optimization when omitting the frame pointer?
If I have understood correctly by reading this page, -fomit-frame-pointer is used when we want to avoid saving, setting up and restoring frame pointers.
Is this done only for each function call and if so, is it really worth to avoid a few instructions for every function?
Isn't it trivial for an optimization.
What are the actual implications of using this option apart from the debugging limitations?
I compiled the following C code with and without this option
int main(void)
{
int i;
i = myf(1, 2);
}
int myf(int a, int b)
{
return a + b;
}
,
# gcc -S -fomit-frame-pointer code.c -o withoutfp.s
# gcc -S code.c -o withfp.s
.
diff -u 'ing the two files revealed the following assembly code:
--- withfp.s 2009-12-22 00:03:59.000000000 +0000
+++ withoutfp.s 2009-12-22 00:04:17.000000000 +0000
## -7,17 +7,14 ##
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
- pushl %ebp
- movl %esp, %ebp
pushl %ecx
- subl $36, %esp
+ subl $24, %esp
movl $2, 4(%esp)
movl $1, (%esp)
call myf
- movl %eax, -8(%ebp)
- addl $36, %esp
+ movl %eax, 20(%esp)
+ addl $24, %esp
popl %ecx
- popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
## -25,11 +22,8 ##
.globl myf
.type myf, #function
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
.size myf, .-myf
.ident "GCC: (GNU) 4.2.1 20070719
Could someone please shed light on the key points of the above code where -fomit-frame-pointer did actually make the difference?
Edit: objdump's output replaced with gcc -S's
-fomit-frame-pointer allows one extra register to be available for general-purpose use. I would assume this is really only a big deal on 32-bit x86, which is a bit starved for registers.*
One would expect to see EBP no longer saved and adjusted on every function call, and probably some additional use of EBP in normal code, and fewer stack operations on occasions where EBP gets used as a general-purpose register.
Your code is far too simple to see any benefit from this sort of optimization-- you're not using enough registers. Also, you haven't turned on the optimizer, which might be necessary to see some of these effects.
* ISA registers, not micro-architecture registers.
The only downside of omitting it is that debugging is much more difficult.
The major upside is that there is one extra general purpose register which can make a big difference on performance. Obviously this extra register is used only when needed (probably in your very simple function it isn't); in some functions it makes more difference than in others.
You can often get more meaningful assembly code from GCC by using the -S argument to output the assembly:
$ gcc code.c -S -o withfp.s
$ gcc code.c -S -o withoutfp.s -fomit-frame-pointer
$ diff -u withfp.s withoutfp.s
GCC doesn't care about the address, so we can compare the actual instructions generated directly. For your leaf function, this gives:
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
GCC doesn't generate the code to push the frame pointer onto the stack, and this changes the relative address of the arguments passed to the function on the stack.
Profile your program to see if there is a significant difference.
Next, profile your development process. Is debugging easier or more difficult? Do you spend more time developing or less?
Optimizations without profiling are a waste of time and money.