Is there any substantial optimization when omitting the frame pointer?
If I have understood correctly by reading this page, -fomit-frame-pointer is used when we want to avoid saving, setting up and restoring frame pointers.
Is this done only for each function call and if so, is it really worth to avoid a few instructions for every function?
Isn't it trivial for an optimization.
What are the actual implications of using this option apart from the debugging limitations?
I compiled the following C code with and without this option
int main(void)
{
int i;
i = myf(1, 2);
}
int myf(int a, int b)
{
return a + b;
}
,
# gcc -S -fomit-frame-pointer code.c -o withoutfp.s
# gcc -S code.c -o withfp.s
.
diff -u 'ing the two files revealed the following assembly code:
--- withfp.s 2009-12-22 00:03:59.000000000 +0000
+++ withoutfp.s 2009-12-22 00:04:17.000000000 +0000
## -7,17 +7,14 ##
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
- pushl %ebp
- movl %esp, %ebp
pushl %ecx
- subl $36, %esp
+ subl $24, %esp
movl $2, 4(%esp)
movl $1, (%esp)
call myf
- movl %eax, -8(%ebp)
- addl $36, %esp
+ movl %eax, 20(%esp)
+ addl $24, %esp
popl %ecx
- popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
## -25,11 +22,8 ##
.globl myf
.type myf, #function
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
.size myf, .-myf
.ident "GCC: (GNU) 4.2.1 20070719
Could someone please shed light on the key points of the above code where -fomit-frame-pointer did actually make the difference?
Edit: objdump's output replaced with gcc -S's
-fomit-frame-pointer allows one extra register to be available for general-purpose use. I would assume this is really only a big deal on 32-bit x86, which is a bit starved for registers.*
One would expect to see EBP no longer saved and adjusted on every function call, and probably some additional use of EBP in normal code, and fewer stack operations on occasions where EBP gets used as a general-purpose register.
Your code is far too simple to see any benefit from this sort of optimization-- you're not using enough registers. Also, you haven't turned on the optimizer, which might be necessary to see some of these effects.
* ISA registers, not micro-architecture registers.
The only downside of omitting it is that debugging is much more difficult.
The major upside is that there is one extra general purpose register which can make a big difference on performance. Obviously this extra register is used only when needed (probably in your very simple function it isn't); in some functions it makes more difference than in others.
You can often get more meaningful assembly code from GCC by using the -S argument to output the assembly:
$ gcc code.c -S -o withfp.s
$ gcc code.c -S -o withoutfp.s -fomit-frame-pointer
$ diff -u withfp.s withoutfp.s
GCC doesn't care about the address, so we can compare the actual instructions generated directly. For your leaf function, this gives:
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
GCC doesn't generate the code to push the frame pointer onto the stack, and this changes the relative address of the arguments passed to the function on the stack.
Profile your program to see if there is a significant difference.
Next, profile your development process. Is debugging easier or more difficult? Do you spend more time developing or less?
Optimizations without profiling are a waste of time and money.
Related
I tried to compile and convert a very simple C program to assembly language.
I am using Ubuntu and the OS type is 64 bit.
This is the C Program.
void add();
int main() {
add();
return 0;
}
if i use gcc -S -m32 -fno-asynchronous-unwind-tables -o simple.S simple.c this is how my assembly source code File should look like:
.file "main1.c"
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call add
movl $0, %eax
movl %ebp, %esp
popl %ebp
ret
.size main, .-main
.ident "GCC: (Debian 4.4.5-8) 4.4.5" // this part should say Ubuntu instead of Debian
.section .note.GNU-stack,"",#progbits
but instead it looks like this:
.file "main0.c"
.text
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
call __x86.get_pc_thunk.ax
addl $_GLOBAL_OFFSET_TABLE_, %eax
movl %eax, %ebx
call add#PLT
movl $0, %eax
popl %ecx
popl %ebx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.section
.text.__x86.get_pc_thunk.ax,"axG",#progbits,__x86.get_pc_thunk.ax,comdat
.globl __x86.get_pc_thunk.ax
.hidden __x86.get_pc_thunk.ax
.type __x86.get_pc_thunk.ax, #function
__x86.get_pc_thunk.ax:
movl (%esp), %eax
ret
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
At my University they told me to use the Flag -m32 if I am using a 64 bit Linux version. Can somebody tell me what I am doing wrong?
Am I even using the correct Flag?
edit after -fno-pie
.file "main0.c"
.text
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
call add
movl $0, %eax
addl $4, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
it looks better but it's not exactly the same.
for example what does leal mean?
As a general rule, you cannot expect two different compilers to generate the same assembly code for the same input, even if they have the same version number; they could have any number of extra "patches" to their code generation. As long as the observable behavior is the same, anything goes.
You should also know that GCC, in its default -O0 mode, generates intentionally bad code. It's tuned for ease of debugging and speed of compilation, not for either clarity or efficiency of the generated code. It is often easier to understand the code generated by gcc -O1 than the code generated by gcc -O0.
You should also know that the main function often needs to do extra setup and teardown that other functions do not need to do. The instruction leal 4(%esp),%ecx is part of that extra setup. If you only want to understand the machine code corresponding to the code you wrote, and not the nitty details of the ABI, name your test function something other than main.
(As pointed out in the comments, that setup code is not as tightly tuned as it could be, but it doesn't normally matter, because it's only executed once in the lifetime of the program.)
Now, to answer the question that was literally asked, the reason for the appearance of
call __x86.get_pc_thunk.ax
is because your compiler defaults to generating "position-independent" executables. Position-independent means the operating system can load the program's machine code at any address in (virtual) memory and it'll still work. This allows things like address space layout randomization, but to make it work, you have to take special steps to set up a "global pointer" at the beginning of every function that accesses global variables or calls another function (with some exceptions). It's actually easier to explain the code that's generated if you turn optimization on:
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
This is all just setting up main's stack frame and saving registers that need to be saved. You can ignore it.
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
The special function __x86.get_pc_thunk.bx loads its return address -- which is the address of the addl instruction that immediately follows -- into the EBX register. Then we add to that address the value of the magic constant _GLOBAL_OFFSET_TABLE_, which, in position-independent code, is the difference between the address of the instruction that uses _GLOBAL_OFFSET_TABLE_ and the address of the global offset table. Thus, EBX now points to the global offset table.
call add#PLT
Now we call add#PLT, which means call add, but jump through the "procedure linkage table" to do it. The PLT takes care of the possibility that add is defined in a shared library rather than the main executable. The code in the PLT uses the global offset table and assumes that you have already set EBX to point to it, before calling an #PLT symbol. That's why main has to set up EBX even though nothing appears to use it. If you had instead written something like
extern int number;
int main(void) { return number; }
then you would see a direct use of the GOT, something like
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl number#GOT(%ebx), %eax
movl (%eax), %eax
We load up EBX with the address of the GOT, then we can load the address of the global variable number from the GOT, and then we actually dereference the address to get the value of number.
If you compile 64-bit code instead, you'll see something different and much simpler:
movl number(%rip), %eax
Instead of all this mucking around with the GOT, we can just load number from a fixed offset from the program counter. PC-relative addressing was added along with the 64-bit extensions to the x86 architecture. Similarly, your original program, in 64-bit position-independent mode, will just say
call add#PLT
without setting up EBX first. The call still has to go through the PLT, but the PLT uses PC-relative addressing itself and doesn't need any help from its caller.
The only difference between __x86.get_pc_thunk.bx and __x86.get_pc_thunk.ax is which register they store their return address in: EBX for .bx, EAX for .ax. I have also seen GCC generate .cx and .dx variants. It's just a matter of which register it wants to use for the global pointer -- it must be EBX if there are going to be calls through the PLT, but if there aren't any then it can use any register, so it tries to pick one that isn't needed for anything else.
Why does it call a function to get the return address? Older compilers would do this instead:
call 1f
1: pop %ebx
but that screws up return-address prediction, so nowadays the compiler goes to a little extra trouble to make sure every call is paired with a ret.
The extra junk you're seeing is due to your version of GCC special-casing main to compensate for possibly-broken entry point code starting it with a misaligned stack. I'm not sure how to disable this or if it's even possible, but renaming the function to something other than main will suppress it for the sake of your reading.
After renaming to xmain I get:
xmain:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
call add
movl $0, %eax
leave
ret
When I compile
#include <stdio.h>
int
main () {
return 0;
}
to x86 assembly the result is plain and expected:
$> cc -m32 -S main.c -o -|sed -r "/\s*\./d"
main:
pushl %ebp
movl %esp, %ebp
movl $0, %eax
popl %ebp
ret
But when studying different disassembled binaries, the function prologue is never that simple. Indeed, changing the C source above into
#include <stdio.h>
int
main () {
printf("Hi");
return 0;
}
the result is
$> cc -m32 -S main.c -o -|sed -r "/\s*\./d"
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
subl $12, %esp
call printf
addl $16, %esp
movl $0, %eax
movl -4(%ebp), %ecx
leave
leal -4(%ecx), %esp
ret
In particular, I don't get why these instructions
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
are generated -- specifically why not directly storing %esp into %ecx, instead of into%esp+4?
If main isn't a leaf function, it needs to align the stack for the benefit of any functions it calls. Functions that aren't called main just maintain the stack's alignment.
lea 4(%esp), %ecx # ecx = esp+4
andl $-16, %esp
pushl -4(%ecx) # load from ecx-4 and push that
It's pushing a copy of the return address, so it will be in the right place after aligning the stack. You're right, a different sequence would be more sensible:
mov (%esp), %ecx ; or maybe even pop %ecx
andl $-16, %esp
push %ecx ; push (mem) is slower than push reg
As Youka says in comments, don't expect code from -O0 to be optimized at all. Use -Og for optimizations that don't interfere with debugability. The gcc manual recommends that for compile/debug/edit cycles. -O0 output is harder to read / understand / learn from than optimized code. It's easier to map back to the source, but it's terrible code.
I have an assembly function like so
rfact:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $4, %esp
movl 8(%ebp), %ebx
movl $1, %eax
cmpl $1, %ebx
jle .L53
leal -1(%ebx), %eax
movl %eax, (%esp)
call rfact
imull %ebx, %eax
.L53:
addl $4, %esp
popl %ebx
popl %ebp
ret
I understand I can't just save this as rfact.s and compile it. There has to be certain items (such as .text) appended to the top of the assembly. What are these for a linux system? And I'd like to call this function from a main function written in normal c file called rfactmain.c
Here's a 'minimal' prefix of directives - for ELF/SysV i386, and GNU as:
.text
.p2align 4
.globl rfact
.type rfact, #function
I'd also recommend appending a function size directive at the end:
.size rfact, .-rfact
The easiest way to compile is with: gcc [-m32] -c -P rfact.S
With the -P option, you can use C-style comments and not have to worry about line number output, etc. This results in an object file you can link with. The -m32 flag is required if gcc targets x86-64 by default.
Let's say I have following code:
int f() {
int foo = 0;
int bar = 0;
foo++;
bar++;
// many more repeated operations in actual code
foo++;
bar++;
return foo+bar;
}
Abstracting repeated code into a separate functions, we get
static void change_locals(int *foo_p, int *bar_p) {
*foo_p++;
*bar_p++;
}
int f() {
int foo = 0;
int bar = 0;
change_locals(&foo, &bar);
change_locals(&foo, &bar);
return foo+bar;
}
I'd expect the compiler to inline the change_locals function, and optimize things like *(&foo)++ in the resulting code to foo++.
If I remember correctly, taking address of a local variable usually prevents some optimizations (e.g. it can't be stored in registers), but does this apply when no pointer arithmetic is done on the address and it doesn't escape from the function? With a larger change_locals, would it make a difference if it was declared inline (__inline in MSVC)?
I am particularly interested in behavior of GCC and MSVC compilers.
inline (and all its cousins _inline, __inline...) are ignored by gcc. It might inline anything it decides is an advantage, except at lower optimization levels.
The code procedure by gcc -O3 for x86 is:
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
.ident "GCC: (GNU) 4.4.4 20100630 (Red Hat 4.4.4-10)"
It returns zero because *ptr++ doesn't do what you think. Correcting the increments to:
(*foo_p)++;
(*bar_p)++;
results in
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
movl $4, %eax
movl %esp, %ebp
popl %ebp
ret
So it directly returns 4. Not only did it inline them, but it optimized the calculations away.
Vc++ from vs 2005 provides similar code, but it also created unreachable code for change_locals(). I used the command line
/O2 /FD /EHsc /MD /FA /c /TP
If I remember correctly, taking
address of a local variable usually
prevents some optimizations (e.g. it
can't be stored in registers), but
does this apply when no pointer
arithmetic is done on the address and
it doesn't escape from the function?
The general answer is that if the compiler can ensure that no one else will change a value behind its back, it can safely be placed in a register.
Think of this as though the compiler first performs inlining, then transforms all those *&foo (which results from the inlining) to simply foo before deciding if they should be placed in registers on in memory on the stack.
With a larger change_locals, would it
make a difference if it was declared
inline (__inline in MSVC)?
Again, generally speaking, whether or not a compiler decides to inline something is done using heuristics. If you explicitly specify that you want something to be inlines, the compiler will probably weight this into its decision process.
I've tested gcc 4.5, MSC and IntelC using this:
#include <stdio.h>
void change_locals(int *foo_p, int *bar_p) {
(*foo_p)++;
(*bar_p)++;
}
int main() {
int foo = printf("");
int bar = printf("");
change_locals(&foo, &bar);
change_locals(&foo, &bar);
printf( "%i\n", foo+bar );
}
And all of them did inline/optimize the foo+bar value, but also did
generate the code for change_locals() (but didn't use it).
Unfortunately, there's still no guarantee that they'd do the same for
any kind of such a "local function".
gcc:
__Z13change_localsPiS_:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
incl (%edx)
incl (%eax)
leave
ret
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
pushl %ebx
subl $28, %esp
call ___main
movl $LC0, (%esp)
call _printf
movl %eax, %ebx
movl $LC0, (%esp)
call _printf
leal 4(%ebx,%eax), %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
addl $28, %esp
popl %ebx
leave
ret
The return value of a function is usually stored on the stack or in a register. But for a large structure, it has to be on the stack. How much copying has to happen in a real compiler for this code? Or is it optimized away?
For example:
struct Data {
unsigned values[256];
};
Data createData()
{
Data data;
// initialize data values...
return data;
}
(Assuming the function cannot be inlined..)
None; no copies are done.
The address of the caller's Data return value is actually passed as a hidden argument to the function, and the createData function simply writes into the caller's stack frame.
This is known as the named return value optimisation. Also see the c++ faq on this topic.
commercial-grade C++ compilers implement return-by-value in a way that lets them eliminate the overhead, at least in simple cases
...
When yourCode() calls rbv(), the compiler secretly passes a pointer to the location where rbv() is supposed to construct the "returned" object.
You can demonstrate that this has been done by adding a destructor with a printf to your struct. The destructor should only be called once if this return-by-value optimisation is in operation, otherwise twice.
Also you can check the assembly to see that this happens:
Data createData()
{
Data data;
// initialize data values...
data.values[5] = 6;
return data;
}
here's the assembly:
__Z10createDatav:
LFB2:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
subl $1032, %esp
LCFI2:
movl 8(%ebp), %eax
movl $6, 20(%eax)
leave
ret $4
LFE2:
Curiously, it allocated enough space on the stack for the data item subl $1032, %esp, but note that it takes the first argument on the stack 8(%ebp) as the base address of the object, and then initialises element 6 of that item. Since we didn't specify any arguments to createData, this is curious until you realise this is the secret hidden pointer to the parent's version of Data.
But for a large structure, it has to be on the heap stack.
Indeed so! A large structure declared as a local variable is allocated on the stack. Glad to have that cleared up.
As for avoiding copying, as others have noted:
Most calling conventions deal with "function returning struct" by passing an additional parameter that points the location in the caller's stack frame in which the struct should be placed. This is definitely a matter for the calling convention and not the language.
With this calling convention, it becomes possible for even a relatively simple compiler to notice when a code path is definitely going to return a struct, and for it to fix assignments to that struct's members so that they go directly into the caller's frame and don't have to be copied. The key is for the compiler to notice that all terminating code paths through the function return the same struct variable. If that's the case, the compiler can safely use the space in the caller's frame, eliminating the need for a copy at the point of return.
There are many examples given, but basically
This question does not have any definite answer. it will depend on the compiler.
C does not specify how large structs are returned from a function.
Here's some tests for one particular compiler, gcc 4.1.2 on x86 RHEL 5.4
gcc trivial case, no copying
[00:05:21 1 ~] $ gcc -O2 -S -c t.c
[00:05:23 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl $1, 24(%eax)
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc more realistic case , allocate on stack, memcpy to caller
#include <stdlib.h>
struct Data {
unsigned values[256];
};
struct Data createData()
{
struct Data data;
int i;
for(i = 0; i < 256 ; i++)
data.values[i] = rand();
return data;
}
[00:06:08 1 ~] $ gcc -O2 -S -c t.c
[00:06:10 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc 4.4.2### has grown a lot, and does not copy for the above non-trivial case.
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
In addition, VS2008 (compiled the above as C) will reserve struct Data on the stack of createData() and do a rep movsd loop to copy it back to the caller in Debug mode, in Release mode it will move the return value of rand() (%eax) directly back to the caller
typedef struct {
unsigned value[256];
} Data;
Data createData(void) {
Data r;
calcualte(&r);
return r;
}
Data d = createData();
msvc(6,8,9) and gcc mingw(3.4.5,4.4.0) will generate code like the following pseudocode
void createData(Data* r) {
calculate(&r)
}
Data d;
createData(&d);
gcc on linux will issue a memcpy() to copy the struct back on the stack of the caller. If the function has internal linkage, more optimizations become available though.