I asked Google to give me the meaning of the gcc option -fomit-frame-pointer, which redirects me to the below statement.
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.
As per my knowledge of each function, an activation record will be created in the stack of the process memory to keep all local variables and some more information. I hope this frame pointer means the address of the activation record of a function.
In this case, what are the type of functions, for which it doesn't need to keep the frame pointer in a register? If I get this information, I will try to design the new function based on that (if possible) because if the frame pointer is not kept in registers, some instructions will be omitted in binary. This will really improve the performance noticeably in an application where there are many functions.
Most smaller functions don't need a frame pointer - larger functions MAY need one.
It's really about how well the compiler manages to track how the stack is used, and where things are on the stack (local variables, arguments passed to the current function and arguments being prepared for a function about to be called). I don't think it's easy to characterize the functions that need or don't need a frame pointer (technically, NO function HAS to have a frame pointer - it's more a case of "if the compiler deems it necessary to reduce the complexity of other code").
I don't think you should "attempt to make functions not have a frame pointer" as part of your strategy for coding - like I said, simple functions don't need them, so use -fomit-frame-pointer, and you'll get one more register available for the register allocator, and save 1-3 instructions on entry/exit to functions. If your function needs a frame pointer, it's because the compiler decides that's a better option than not using a frame pointer. It's not a goal to have functions without a frame pointer, it's a goal to have code that works both correctly and fast.
Note that "not having a frame pointer" should give better performance, but it's not some magic bullet that gives enormous improvements - particularly not on x86-64, which already has 16 registers to start with. On 32-bit x86, since it only has 8 registers, one of which is the stack pointer, and taking up another as the frame pointer means 25% of register-space is taken. To change that to 12.5% is quite an improvement. Of course, compiling for 64-bit will help quite a lot too.
This is all about the BP/EBP/RBP register on Intel platforms. This register defaults to stack segment (doesn’t need a special prefix to access stack segment).
The EBP is the best choice of register for accessing data structures, variables and dynamically allocated work space within the stack. EBP is often used to access elements on the stack relative to a fixed point on the stack rather than relative to the current TOS. It typically identifies the base address of the current stack frame established for the current procedure. When EBP is used as the base register in an offset calculation, the offset is calculated automatically in the current stack segment (i.e., the segment currently selected by SS). Because SS does not have to be explicitly specified, instruction encoding in such cases is more efficient. EBP can also be used to index into segments addressable via other segment registers.
( source - http://css.csail.mit.edu/6.858/2017/readings/i386/s02_03.htm )
Since on most 32-bit platforms, data segment and stack segment are the same, this association of EBP/RBP with the stack is no longer an issue. So is on 64-bit platforms: The x86-64 architecture, introduced by AMD in 2003, has largely dropped support for segmentation in 64-bit mode: four of the segment registers: CS, SS, DS, and ES are forced to 0. These circumstances of x86 32-bit and 64-bit platforms essentially mean that EBP/RBP register can be used, without any prefix, in the processor instructions that access memory.
So the compiler option you wrote about allows the BP/EBP/RBP to be used for other means, e.g., to hold a local variable.
By "This avoids the instructions to save, set up and restore frame pointers" is meant avoiding the following code on the entry of each function:
push ebp
mov ebp, esp
or the enter instruction, which was very useful on Intel 80286 and 80386 processors.
Also, before the function return, the following code is used:
mov esp, ebp
pop ebp
or the leave instruction.
Debugging tools may scan the stack data and use these pushed EBP register data while locating call sites, i.e., to display names of the function and the arguments in the order they have been called hierarchically.
Programmers may have questions about stack frames not in a broad term (that it is a single entity in the stack that serves just one function call and keeps return address, arguments and local variables) but in a narrow sense – when the term stack frames is mentioned in the context of compiler options. From the compiler's perspective, a stack frame is just the entry and exit code for the routine, that pushes an anchor to the stack – that can also be used for debugging and for exception handling. Debugging tools may scan the stack data and use these anchors for back-tracing, while locating call sites in the stack, i.e., to display names of the function in the same order they have been called hierarchically.
That's why it is vital to understand for a programmer what a stack frame is in terms of compiler options – because the compiler can control whether to generate this code or not.
In some cases, the stack frame (entry and exit code for the routine) can be omitted by the compiler, and the variables will directly be accessed via the stack pointer (SP/ESP/RSP) rather than the convenient base pointer (BP/ESP/RSP).
Conditions for a compiler to omit the stack frames for some functions may be different, for example: (1) the function is a leaf function (i.e., an end-entity that doesn't call other functions); (2) no exceptions are used; (3) no routines are called with outgoing parameters on the stack; (4) the function has no parameters.
Omitting stack frames (entry and exit code for the routine) can make code smaller and faster. Still, they may also negatively affect the debuggers' ability to back-trace the stack's data and display it to the programmer. These are the compiler options that determine under which conditions a function should satisfy in order for the compiler to award it with the stack frame entry and exit code. For example, a compiler may have options to add such entry and exit code to functions in the following cases: (a) always, (b) never, (c) when needed (specifying the conditions).
Returning from generalities to particularities: if you use the -fomit-frame-pointer GCC compiler option, you may win on both entry and exit code for the routine, and on having an additional register (unless it is already turned on by default either itself or implicitly by other options, in this case, you are already benefiting from the gain of using the EBP/RBP register and no additional gain will be obtained by explicitly specifying this option if it is already on implicitly). Please note, however, that in 16-bit and 32-bit modes, the BP register doesn't have the ability to provide access to 8-bit parts of it like AX has (AL and AH).
Since this option, besides allowing the compiler to use EBP as a general-purpose register in optimizations, also prevents generating exit and entry code for the stack frame, which complicates the debugging – that's why the GCC documentation explicitly states (unusually emphasizing with a bold style) that enabling this option makes debugging impossible on some machines.
Please also be aware that other compiler options, related to debugging or optimization, may implicitly turn the -fomit-frame-pointer option ON or OFF.
I didn't find any official information at gcc.gnu.org about how do other options affect -fomit-frame-pointer on x86 platforms,
the https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html only states the following:
-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.
So it is not clear from the documentation per se whether -fomit-frame-pointer will be turned on if you just compile with a single `-O' option on x86 platform. It may be tested empirically, but in this case there is no commitment from the GCC developers to not change the behavior of this option in the future without notice.
However, Peter Cordes has pointed out in comments that there is a difference for the default settings of the -fomit-frame-pointer between x86-16 platforms and x86-32/64 platforms.
This option – -fomit-frame-pointer – is also relevant to the Intel C++ Compiler 15.0, not only to the GCC:
For the Intel Compiler, this option has an alias /Oy.
Here is what Intel wrote about it:
These options determine whether EBP is used as a general-purpose register in optimizations. Options -fomit-frame-pointer and /Oy allow this use. Options -fno-omit-frame-pointer and /Oy- disallow it.
Some debuggers expect EBP to be used as a stack frame pointer, and cannot produce a stack back-trace unless this is so. The -fno-omit-frame-pointer and /Oy- options direct the compiler to generate code that maintains and uses EBP as a stack frame pointer for all functions so that a debugger can still produce a stack back-trace without doing the following:
For -fno-omit-frame-pointer: turning off optimizations with -O0
For /Oy-: turning off /O1, /O2, or /O3 optimizations
The -fno-omit-frame-pointer option is set when you specify option -O0 or the -g option. The -fomit-frame-pointer option is set when you specify option -O1, -O2, or -O3.
The /Oy option is set when you specify the /O1, /O2, or /O3 option. Option /Oy- is set when you specify the /Od option.
Using the -fno-omit-frame-pointer or /Oy- option reduces the number of available general-purpose registers by 1 and can result in slightly less efficient code.
NOTE For Linux* systems: There is currently an issue with GCC 3.2 exception handling. Therefore, the Intel compiler ignores this option when GCC 3.2 is installed for C++ and exception handling is turned on (the default).
Please be aware that the above quote is only relevant for the Intel C++ 15 compiler, not to GCC.
I haven't come across the term "activation record" before, but I would assume it reffers to what is normally called a "stack frame". That is the area on the stack used by the current function.
The frame pointer is a register that holds the address of the current function's stack frame. If a frame pointer is used then on entering the function the old frame pointer is saved to the stack and the frame pointer is set to the stack pointer. On leaving the function the old frame pointer is restored.
Most normal functions don't need a frame pointer for their own operation. The compiler can keep track of the stack pointer offset on all codepaths through the function and generate local variable accesses accordingly.
A frame pointer may be important in some contexts for debugging and exception handling. This is becoming increasingly rare though as modern debugging and exception handling formats are designed to support functions without frame pointers in most cases.
The main time a frame pointer is needed nowadays is if a function uses alloca or variable length arrays. In this case the value of the stack pointer cannot be tracked statically.
Related
I have a small program running on x64 calling system function with a parameter long enough which means he will be pushed to function on the stack as I understand.
#include <stdlib.h>
int main(void) {
char command[] = "/bin/sh -c whoami";
system(command);
return EXIT_SUCCESS;
}
When I check in GDB what is happening I can confirm that my parameter is on the stack on 2 words.
I wonder how does the CPU know that it needs to read 2 words and not continue after. What delimit the function parameter from the rest ?
I am asking this question as I am working on Buffer Overflow and while I have the same situation on the stack, the CPU does only pick one word (/bin/sh ) instead of the 2 words I would like. Outputing sh: line 1: $'Ћ\310\367\377\177': command not found
How does processor know how much to read from the stack for function parameters (x64)
The CPU does not know. By that, I mean it does not receive an instruction that says "retrieve the next argument from the stack, whatever the appropriate size may be." It receives instructions to retrieve data of a specific size from a specific place, and to operate on that data, or put it in a register, or store it in some other place. Those instructions are generated by the compiler, based on the program source code, and they are part of the program binary.
I wonder how does the CPU know that it needs to read 2 words and not continue after. What delimit the function parameter from the rest ?
Nothing delimits one function parameter from the next -- neither on the stack nor generally. Programs do not (generally) figure out such things on the fly by introspecting the data. Instead, functions require parameters to be set up in a particular way, which is governed by a set of conventions called an "Application Binary Interface" (ABI), and they operate on the assumption that the data indeed are set up that way. If those assumptions turn out to be invalid then more or less anything can happen.
I am asking this question as I am working on Buffer Overflow and while I have the same situation on the stack, the CPU does only pick one word (/bin/sh ) instead of the 2 words I would like.
The number of words the function will consume from the stack and the significance it will attribute to them is characteristic of the function, not (generally) of the data on the stack.
Processors are very very dumb. All of them. This is like asking how do you steer a train...You do not. It just follows the tracks. The processor just follows the bits in front of it, if they are wrong or do something bad then the processor will crash just like a train will derail if the tracks are bad.
The size of a variable is not determined by the processor type, x86, arm, etc. Nor for C is it determined by the language, the size of an int for x86 is not assumed to be one size. Assumptions like that are bad. The compiler author chooses for that compiler for that target. And no reason to assume any two C compilers for the same target processor use the same sizes.
Likewise the compiler author ultimately decides the calling convention, what goes in registers what goes in stack, what order they are in the stack, what registers, etc.
The compiler author chooses also the alignment or not of the stack.
The compiler author chooses to use a stack frame or not or allows the user to choose, but within either choice, with or without still chooses how to use the stack or stack pointer.
The compiler author using their calling convention, their choices for the sizes of variables, etc then as part of the compilation process decide what instructions to use. The instructions should be chosen base on their choices above. So a two byte sized variable should be in the stack based on decisions made by the compilation relative to the stack pointer or stack frame pointer based on compiler choices and possibly user options.
The processor does not know, it simply sucks in bits and does what they say, if the compiler and assembler and linker have done their job, ultimately the programmers responsibility, then the processor will do what it is told, including reading the proper number of bytes for a certain item.
As beaten to death on this site, examining the stack for main() tends to be confusing as there is mysterious padding added, ideally you want to compile this in some other function name and see that. Also compiler options may determine how the code is built, what instructions are used and how much stack if any. Optimization levels. No reason to assume any two compilers will generate the same code from some C source, likewise no reason to assume one compiler will produce the same code based on compiler options.
So where on the stack, how many bytes on the stack, etc is determined by many layers of you the programmer plus compiler, assembler, and linker.
Depends on the calling convention implemented for the function. By specifying none, you let the compiler decide, and it can go creative, sometimes even disappearing with any explicit call for the sake of branch prediction optimization, otherwise you can learn precisely what to expect from numerous sources of documentation that specify how those calling conventions are supposed to work.
Add -funwind-tables when cross-compiling, you can successfully unwind backtrace through the interface(_Unwind_Backtrace and _Unwind_VRS_Get) in the libgcc library.
But when I added the -O2 option at cross-compiling time, unwind backtrace would fail. I pass -Q -O2 --help=optimizers print out the optimization and testing, but the results and -O2 is different, very strange,
You haven't told us which ARM architecture you are building for - but assuming it's a 32-bit architecture, enabling -O2 has also enabled -fomit-frame-pointer (§ -fomit-frame-pointer)
The frame-pointer
The frame pointer contains the base of the current function's stack frame (and with the knowledge that the caller's frame pointer is stored on the stack, a linked list of all stack frames in the call-tree). It's usually described as fp in documentation - a synonym for r11.
Omitting the frame-pointer
The ARM register file is small at 16 registers - one of which is the program counter.
The frame pointer is one of the 15 that remains and is used only for debugging and diagnostics - specifically to provide stack walk-backs and symbolic debugging.
-fomit-frame-pointer tells the compiler to not maintain a frame pointer, thus liberating the r11 for other uses, and potentially avoiding spill of variables to the stack from registers. It also saves 4 bytes per stack-frame of stack storage and a store and load to the stack.
Naturally, if fp is used as general purpose register, its contents are undefined and walk-backs won't work.
You probably want to reenable the frame pointer with -fno-omit-frame-pointer for your own sanity.
From the GCC documentation
On the Intel x86, the force_align_arg_pointer attribute may be applied to individual function definitions, generating an alternate prologue and epilogue that realigns the runtime stack. This supports mixing legacy codes that run with a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility. The alternate prologue and epilogue are slower and bigger than the regular ones, and the alternate prologue requires a scratch register; this lowers the number of registers available if used in conjunction with the regparm attribute. The force_align_arg_pointer attribute is incompatible with nested functions; this is considered a hard error.
Specifically, I want to know what is a prologue, epilogue, and SSE compatibility?
From gcc manual:
void TARGET_ASM_FUNCTION_PROLOGUE (FILE *file, HOST_WIDE_INT size)
The prologue is responsible for setting up the stack frame, initializing the frame pointer register, saving registers that must be saved, and allocating size additional bytes of storage for the local variables. file is a stdio stream to which the assembler code should be output.
On machines that have “register windows”, the function entry code does not save on the stack the registers that are in the windows, even if they are supposed to be preserved by function calls; instead it takes appropriate steps to “push” the register stack, if any non-call-used registers are used in the function.
On machines where functions may or may not have frame-pointers, the function entry code must vary accordingly; it must set up the frame pointer if one is wanted, and not otherwise. To determine whether a frame pointer is in wanted, the macro can refer to the variable frame_pointer_needed. The variable's value will be 1 at run time in a function that needs a frame pointer.
void TARGET_ASM_FUNCTION_EPILOGUE (FILE *file, HOST_WIDE_INT size)
If defined, a function that outputs the assembler code for exit from a function. The epilogue is responsible for restoring the saved registers and stack pointer to their values when the function was called, and returning control to the caller. This macro takes the same arguments as the macro TARGET_ASM_FUNCTION_PROLOGUE, and the registers to restore are determined from regs_ever_live and CALL_USED_REGISTERS in the same way.
SSE (Streaming SIMD Extensions) is a collection of 128 bit CPU registers. These registers can be packed with 4, 32-bit scalars after which an operation can be performed on each of the 4 elements simultaneously. In contrast it may take 4 or more operations in regular assembly to do the same thing.
Is there a tool to where I have spills in my c code?
I mean see what block of code potentially make a register move to memory.
EDIT: what is a spill:
In the process of compiling your code at some point you will have to do register allocation. The compiler will do an interference graph ( "variables" are nodes and they are connected if they are alive at the same time ). From this point there is a linear process that will do graph coloring: for each variable assign a register that wont interfere with other variables... If you don't have enough register to color the graph the algorithm will fail
and a variable(register) will be spilled ( moved to memory ).
From a software engineering point of view, this mean you should always minimize a variable live so you can minimize the chance of having a spill.
When you want to optimize code you should look for those kinds of things since a spill will give an extra time to read/write memory. I was looking for a tool or a compiler flag that could tell me where is spill so I can optimize.
I'm aware of no such tool.
Because decisions about spills vary from compiler to compiler, and version of the compiler and even by settings within a given version of a given compiler, any such tool would have to be tightly coupled to a compiler and would likely only support one.
On the other hand, you can always look at the generated assembly yourself and see if a given variable is spilled or not.
Generally either disassemble or compile to assembler instead of an object.
For specific compilers like gcc and llvm (where you have the source and can easily re-build the compiler), modify the compiler to print some sort of output to indicate how many times it had to spill, as you call it, to memory. Perhaps as you find the register allocation routine, you may find that the compiler already has such output. Personally I just disassemble or compile to assembler.
A generic assembler analysis tool is possible, but is it worth the effort? You would want to know where function/optimization boundaries are. You would want to distinguish volatile variables, or hardware registers where the write to ram was intentional. You could just look for stack based writes only. Or look for cases where there is a write to the stack that is not a push, where the register is destroyed on the next instruction. Actually it would be pretty easy to search for writes to a stack pointer relative address, with the next instruction destroying the register, with that stack based relative address being read back in a relatively nearby execution path where the stack frame has not been cleaned up in that execution path. Do I know of such a tool? Nope.
When compiling shared libraries in gcc the -fPIC option compiles the code as position independent. Is there any reason (performance or otherwise) why you would not compile all code position independent?
It adds an indirection. With position independent code you have to load the address of your function and then jump to it. Normally the address of the function is already present in the instruction stream.
Yes there are performance reasons. Some accesses are effectively under another layer of indirection to get the absolute position in memory.
There is also the GOT (Global offset table) which stores offsets of global variables. To me, this just looks like an IAT fixup table, which is classified as position dependent by wikipedia and a few other sources.
http://en.wikipedia.org/wiki/Position_independent_code
In addition to the accepted answer. One thing that hurts PIC code performance a lot is the lack of "IP relative addressing" on x86. With "IP relative addressing" you could ask for data that is X bytes from the current instruction pointer. This would make PIC code a lot simpler.
Jumps and calls, are usually EIP relative, so those don't really pose a problem. However, accessing data will require a little extra trickery. Sometimes, a register will be temporarily reserved as a "base pointer" to data that the code requires. For example, a common technique is to abuse the way calls work on x86:
call label_1
.dd 0xdeadbeef
.dd 0xfeedf00d
.dd 0x11223344
label_1:
pop ebp ; now ebp holds the address of the first dataword
; this works because the call pushes the **next**
; instructions address
; real code follows
mov eax, [ebp + 4] ; for example i'm accessing the '0xfeedf00d' in a PIC way
This and other techniques add a layer of indirection to the data accesses. For example, the GOT (Global offset table) used by gcc compilers.
x86-64 added a "RIP relative" mode which makes things a lot simpler.
Because implementing completely position independent code adds a constraint to the code generator which can prevent the use of faster operations, or add extra steps to preserve that constraint.
This might be an acceptable trade-off to get multiprocessing without a virtual memory system, where you trust processes to not invade each other's memory and might need to load a particular application at any base address.
In many modern systems the performance trade-offs are different, and a relocating loader is often less expensive (it costs any time code is first loaded) than the best an optimizer can do if it has free reign. Also, the availability of virtual address spaces hides most of the motivation for position independence in the first place.
position-independent code has a performance overhead on most architecture, because it requires an extra register.
So, this is for performance purpose.
Also, virtual memory hardware in most modern processors (used by most modern OSes) means that lots of code (all user space apps, barring quirky use of mmap or the like) doesn't need to be position independent. Every program gets its own address space which it thinks starts at zero.
Nowadays operating system and compiler by default make all the code as position independent code. Try compiling without the -fPIC flag, the code will compile fine but you will just get a warning.OS's like windows use a technique called as memory mapping to achieve this.