From the GCC documentation
On the Intel x86, the force_align_arg_pointer attribute may be applied to individual function definitions, generating an alternate prologue and epilogue that realigns the runtime stack. This supports mixing legacy codes that run with a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility. The alternate prologue and epilogue are slower and bigger than the regular ones, and the alternate prologue requires a scratch register; this lowers the number of registers available if used in conjunction with the regparm attribute. The force_align_arg_pointer attribute is incompatible with nested functions; this is considered a hard error.
Specifically, I want to know what is a prologue, epilogue, and SSE compatibility?
From gcc manual:
void TARGET_ASM_FUNCTION_PROLOGUE (FILE *file, HOST_WIDE_INT size)
The prologue is responsible for setting up the stack frame, initializing the frame pointer register, saving registers that must be saved, and allocating size additional bytes of storage for the local variables. file is a stdio stream to which the assembler code should be output.
On machines that have “register windows”, the function entry code does not save on the stack the registers that are in the windows, even if they are supposed to be preserved by function calls; instead it takes appropriate steps to “push” the register stack, if any non-call-used registers are used in the function.
On machines where functions may or may not have frame-pointers, the function entry code must vary accordingly; it must set up the frame pointer if one is wanted, and not otherwise. To determine whether a frame pointer is in wanted, the macro can refer to the variable frame_pointer_needed. The variable's value will be 1 at run time in a function that needs a frame pointer.
void TARGET_ASM_FUNCTION_EPILOGUE (FILE *file, HOST_WIDE_INT size)
If defined, a function that outputs the assembler code for exit from a function. The epilogue is responsible for restoring the saved registers and stack pointer to their values when the function was called, and returning control to the caller. This macro takes the same arguments as the macro TARGET_ASM_FUNCTION_PROLOGUE, and the registers to restore are determined from regs_ever_live and CALL_USED_REGISTERS in the same way.
SSE (Streaming SIMD Extensions) is a collection of 128 bit CPU registers. These registers can be packed with 4, 32-bit scalars after which an operation can be performed on each of the 4 elements simultaneously. In contrast it may take 4 or more operations in regular assembly to do the same thing.
Related
As stated, what software-visible processor state needs to go in a jmp_buf on an x86-64 processor when setjmp(jmp_buf env) is called? What processor state does not?
I have been reading a lot about setjmp and longjmp but couldn't find a clear answer to my question. I know it is implementation dependent but I would like to know for the x86_64 architecture.
From the following implementation
it seems that on an x86-64 machine all the callee saved registers (%r12-%r15, %rbp, %rbx) need to be saved as well as the stack pointer, program counter and all the saved arguments of the current environment. However I'm not sure about that, hope someone could clarify that for me.
For example, which x86-64 registers need to be saved? What about condition flags? For example, I think the floating point registers do not need to be saved because they don't contribute to the state of the program.
That's because of the calling convention. setjmp is a function-call that can return multiple times (the first time when you actually call it, later times when a child function calls longjmp), but it's still a function call. Like any function call, the compiler assumes that all call-clobbered registers have been clobbered, so longjmp doesn't need to restore them.
So yes, they're not part of the "program state" on a function call boundary because the compiler-generated asm is definitely not keeping any values in them.
You're looking at glibc's implementation for the x86-64 System V ABI, where all vector / x87 registers are call-clobbered and thus don't have to be saved.
In the Windows x86-64 calling convention, xmm6-15 are call-preserved (just the low 128 bits, not the upper portions of y/zmm6-15), and would have to be part of the jmp_buf.
i.e. it's not the CPU architecture that's relevant here, it's the software calling convention.
Besides the call-preserved registers, one key thing is that it's only legal to longjmp to a jmp_buf saved by a parent function, not from any arbitrary function after the function that called setjmp has returned.
If setjmp had to support that, it would have to save the entire stack frame, or actually (for the function to be able to return, and that parent to be able to return, etc.) the whole stack all the way up to the top. This is obviously insane, and thus it's clear why longjmp has that restriction of only being able to jump to parent / (great) grandparent functions, so it just has to restore the stack pointer to point at the still-existing stack frame and restore whatever local variables in that function might have been modified since setjmp.
(On C / C++ implementations on architectures / calling conventions that use something other than a normal call-stack, a similar argument about the jump-target function being able to return still applies.)
As the jmp_buf is the only place that can be used to restore processor state on a longjmp, it's generally everything that is needed to restore the full state of the machine as it was when setjmpis called.
This obviously depends very much on the processor and the compiler (what exactly does it use of the CPU's features to store program state):
On an ideal pure-stack machine that holds information of CPU state nowhere but the stack, that would be the stack pointer only. Other than in very old or purely academical implementations, such machines do rarely exist. You could, however, write a compiler on a modern machine like an x86 that solely uses the stack to store such information. For such a hypothetical compiler, saving the stack pointer only would suffice to restore program state.
On a more common, practical machine, this might be the stack pointer and the full set of registers used to store program status.
On some CPUs that store program status information in other places, for example in a zero page, and compilers that make use of such CPU features, the jmp_buff would also need to store a copy of this zero page (certain 65xx CPUs or ATmel AVR MCUs and their compilers might use this feature)
Let us say I have three functions, f1(), f2(), and f3(). When f1 is called, it stores information in CPU registers (and I imagine there is other important information as well). Now, depending on a condition that is unknown at compile-time, f1 will call either f2 or f3. f2 and f3 use very different registers, some of which may overlap with those used by f1. Is the following reasoning correct?
The compiler knows which registers a particular function needs during its execution. Therefore, when f1 calls either f2 or f3, the function call code preserves those registers that f2 or f3 use on the stack, regardless of whether or not they are being used by f1.
Or is there some other mechanism by which the compiler preserves registers so that the function that is being returned to doesn't lose its data?
Recall that a programming language is a specification in a document. For C11, read n1570.
Registers do not exist in C (in other words, the nearly obsolete register keyword is no more related to processor registers). They only matter in machine code (often generated by a C compiler).
However, the code generated by a given compiler (for a given instruction set and target system) obey to some conventions, notably the calling conventions and the ABI (read the system V x86-64 ABI governing Linux for an example). Thes conventions define how registers should be used, and which registers are callee-saved or caller-saved. Register allocation is a difficult part of an optimizing compiler's job.
Often the compiler would emit code to spill some of the registers content into the call stack. And a given register can be used for several things (e.g. it could keep two different variables, if they occur in different places in the same function).
In general the calling convention does not depend upon the called function (recall that you can make indirect calls thru function pointers), but mostly of its signature.
"The compiler knows which registers a particular function needs during its execution."
No, it will generally not know this.
For one reason, a function can be from a (third party) library about which the compiler knows nothing. For another reason, what if that function calls another function, and another etetera?
The compiler will just push all "suspect" registers onto the stack and pops them before returning.
I think as others have stated the arguments for a function are typically sent down via a number of registers (thereafter on the stack). Which registers are used depends on the compiler – for gcc see GNU C/assembler: http://cs.lmu.edu/~ray/notes/gasexamples/
A number of principles worth noting:
stack frame
caller (the function calling f1) and callee functions (your f1, f2... functions)
volatile and non-volatile registers. For your question you only don't need to worry about non-volatile registers.
Each function has a stack frame, this is an expandable block of the stack that temporarily stores data that needs to be loaded in and out of registers.
Before each function call (to the callee from the caller) the values you wish to pass down, i.e. your arguments, will be placed in a number of preordained registers (typically 4-6 depending on a the compiler – see link); if there are more arguments than the number of preordained registers then these additional values are stored on the stack (typically the callers stack frame).
If these preordained registers are being used by the caller, then the compiler will push these values onto the caller's stack frame before assigning the arguments to the registers before making the call to the callee (e.g. your f1 function). Once the called function (callee) returns, these values are restored to their respective registers from the stack.
It doesn't matter how or what order a series of functions are called the same system is followed when the compiler converts your C code to assembly/opcode.
I am using MIPS32 and coding in C.
currently many functions in my code return 'int' data type.
Since my development is on resource constrained hardware (even bytes matter) and the return values are just error codes (don't exceed 255), I am planning to shrink the return type either as int8_t or as int16_t.
What I am trying to achieve is to reduce the stack/memory usage of caller.
Before I attempt,
Will this result in stack/memory usage reduction in the caller? or
Since I have heard of memory alignment (mostly as 4 bytes) & don't know much, will that play a spoil sport here?
Example
int caller(){
int8_t status;
status = callee();
}
int8_t callee() {
...
return -1;
}
In the example above, does the status identifier declaration as int8_t or int16_t or int matters in mips32?
This will create absolutely no change when it comes to the call stack, an example of the MIPS call stack can be found here. https://courses.cs.washington.edu/courses/cse410/09sp/examples/MIPSCallingConventionsSummary.pdf
$31
$ra
The
Return Address
in a subroutine call.
Below that is an image and you will see the return address which is a full register, in your case using a 32bit machine your register will be size of 32bits and there is no changing that.
I do have to ask though, what are you doing that requires MIPS? Generally speaking that is a language used for teaching purposes and doesn't have much in the way of real world practical uses since it has many many flaws. As an example this concept of a return address does not exist with modern assemblies like X86 where the stack pointer will contain all that information.
EDIT:
As pointed out by people below I have been a bit unfair. Technically these address also exist.
$2-$3 $v0-$v1 These registers contain the
Returned Value
of a subroutine; if
the value is 1 word only $v0 is significant.
Again though they have a set size and from the perspective of the call stack they are using one full register. Theoretically I believe MIPS has ways to store 4 bytes inside of one register but I am unsure on this. More importantly though with the way MIPS works these return registers can ONLY be used if the call is one function deep. If you call a function within a function this concept falls apart and the return address becomes required hence why I just showed that one origonally.
First of all, "don't exceed 255" means you should be using uint not int.
When manually optimizing code for size, you should be using the uint_leastn_t types. These types allow the compiler to pick the smallest possible type necessary for the code to work, which is at least n bytes wide.
In your case this would be uint_least8_t. Though of course if the compiler always picks a 32 bit type, because that is what is required for aligned access, then the only thing you have gained by replacing int is better portability.
On MIPS32 the first four function parameters (integers or pointers; for simplicity I'm not considering 64-bit ints, floats or structs) arrive in registers a0 through a3. The rest goes on the stack, with each machine word of stack memory holding just one parameter. So, in terms of passing the error codes there will be no difference.
If you have to store error codes in local (automatic) variables, a lot will depend on the code. However, MIPS has plenty of registers and chances are there will be a register available for an error code and hence no stack space for it will be needed.
If you have global variables holding error codes, then definitely there will be a difference between using differently sized types.
Going back to the stack, you should note that there are several other things at play...
First, the stack must be aligned. This is worsened by the fact that modern compilers tend to align the stack pointer not on a multiple of the machine word, but on a multiple of two machine words. So, if you're considering just one error code, it's quite likely that any gains will be undone by the compiler padding the local variables on the stack to make their cumulative size a multiple of two machine words.
Second, the stack pointer is typically decremented by the size of the local and temporary variables just once at the entry of the function (and the reverse is done just once on exit). This means that in some places in the function there may be some unused stack space, which is reserved only to be used in other places of the function. So, calls (especially deep recursive calls) from some places of the function will be unduly wasting stack space.
Third, those four parameters that arrive in a0 through a3 are required by the ABI to have stack memory associated with them, so they can be stored there and addressed by pointers in functions like printf (recall stdarg.h's va_list, va_start(), va_arg(), etc). So, many calls may be wasting those 16 bytes of stack space as well.
Another thing you might want to consider is that when a function returns 8-bit or 16-bit integer types, the caller will need to sign-extend (or zero-extend) those 8/16 bits to the full machine word size (32 bits), meaning that the caller will have to use additional instructions like seb, seh and andi. So, these may affect code size negatively.
Ultimately, it depends a lot on your code and on your compiler. You can measure the stack usage using both types and using different optimization options of the compiler and choose the best. You can also experiment with restructuring your code to avoid calls or to make it easier for the compiler to optimize it (e.g. static functions help as the compiler may deviate from the ABI when calling them and more effectively optimize them and passing and returning values to and from them). And this is really what you should do, try different things and choose what you like the best.
I asked Google to give me the meaning of the gcc option -fomit-frame-pointer, which redirects me to the below statement.
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.
As per my knowledge of each function, an activation record will be created in the stack of the process memory to keep all local variables and some more information. I hope this frame pointer means the address of the activation record of a function.
In this case, what are the type of functions, for which it doesn't need to keep the frame pointer in a register? If I get this information, I will try to design the new function based on that (if possible) because if the frame pointer is not kept in registers, some instructions will be omitted in binary. This will really improve the performance noticeably in an application where there are many functions.
Most smaller functions don't need a frame pointer - larger functions MAY need one.
It's really about how well the compiler manages to track how the stack is used, and where things are on the stack (local variables, arguments passed to the current function and arguments being prepared for a function about to be called). I don't think it's easy to characterize the functions that need or don't need a frame pointer (technically, NO function HAS to have a frame pointer - it's more a case of "if the compiler deems it necessary to reduce the complexity of other code").
I don't think you should "attempt to make functions not have a frame pointer" as part of your strategy for coding - like I said, simple functions don't need them, so use -fomit-frame-pointer, and you'll get one more register available for the register allocator, and save 1-3 instructions on entry/exit to functions. If your function needs a frame pointer, it's because the compiler decides that's a better option than not using a frame pointer. It's not a goal to have functions without a frame pointer, it's a goal to have code that works both correctly and fast.
Note that "not having a frame pointer" should give better performance, but it's not some magic bullet that gives enormous improvements - particularly not on x86-64, which already has 16 registers to start with. On 32-bit x86, since it only has 8 registers, one of which is the stack pointer, and taking up another as the frame pointer means 25% of register-space is taken. To change that to 12.5% is quite an improvement. Of course, compiling for 64-bit will help quite a lot too.
This is all about the BP/EBP/RBP register on Intel platforms. This register defaults to stack segment (doesn’t need a special prefix to access stack segment).
The EBP is the best choice of register for accessing data structures, variables and dynamically allocated work space within the stack. EBP is often used to access elements on the stack relative to a fixed point on the stack rather than relative to the current TOS. It typically identifies the base address of the current stack frame established for the current procedure. When EBP is used as the base register in an offset calculation, the offset is calculated automatically in the current stack segment (i.e., the segment currently selected by SS). Because SS does not have to be explicitly specified, instruction encoding in such cases is more efficient. EBP can also be used to index into segments addressable via other segment registers.
( source - http://css.csail.mit.edu/6.858/2017/readings/i386/s02_03.htm )
Since on most 32-bit platforms, data segment and stack segment are the same, this association of EBP/RBP with the stack is no longer an issue. So is on 64-bit platforms: The x86-64 architecture, introduced by AMD in 2003, has largely dropped support for segmentation in 64-bit mode: four of the segment registers: CS, SS, DS, and ES are forced to 0. These circumstances of x86 32-bit and 64-bit platforms essentially mean that EBP/RBP register can be used, without any prefix, in the processor instructions that access memory.
So the compiler option you wrote about allows the BP/EBP/RBP to be used for other means, e.g., to hold a local variable.
By "This avoids the instructions to save, set up and restore frame pointers" is meant avoiding the following code on the entry of each function:
push ebp
mov ebp, esp
or the enter instruction, which was very useful on Intel 80286 and 80386 processors.
Also, before the function return, the following code is used:
mov esp, ebp
pop ebp
or the leave instruction.
Debugging tools may scan the stack data and use these pushed EBP register data while locating call sites, i.e., to display names of the function and the arguments in the order they have been called hierarchically.
Programmers may have questions about stack frames not in a broad term (that it is a single entity in the stack that serves just one function call and keeps return address, arguments and local variables) but in a narrow sense – when the term stack frames is mentioned in the context of compiler options. From the compiler's perspective, a stack frame is just the entry and exit code for the routine, that pushes an anchor to the stack – that can also be used for debugging and for exception handling. Debugging tools may scan the stack data and use these anchors for back-tracing, while locating call sites in the stack, i.e., to display names of the function in the same order they have been called hierarchically.
That's why it is vital to understand for a programmer what a stack frame is in terms of compiler options – because the compiler can control whether to generate this code or not.
In some cases, the stack frame (entry and exit code for the routine) can be omitted by the compiler, and the variables will directly be accessed via the stack pointer (SP/ESP/RSP) rather than the convenient base pointer (BP/ESP/RSP).
Conditions for a compiler to omit the stack frames for some functions may be different, for example: (1) the function is a leaf function (i.e., an end-entity that doesn't call other functions); (2) no exceptions are used; (3) no routines are called with outgoing parameters on the stack; (4) the function has no parameters.
Omitting stack frames (entry and exit code for the routine) can make code smaller and faster. Still, they may also negatively affect the debuggers' ability to back-trace the stack's data and display it to the programmer. These are the compiler options that determine under which conditions a function should satisfy in order for the compiler to award it with the stack frame entry and exit code. For example, a compiler may have options to add such entry and exit code to functions in the following cases: (a) always, (b) never, (c) when needed (specifying the conditions).
Returning from generalities to particularities: if you use the -fomit-frame-pointer GCC compiler option, you may win on both entry and exit code for the routine, and on having an additional register (unless it is already turned on by default either itself or implicitly by other options, in this case, you are already benefiting from the gain of using the EBP/RBP register and no additional gain will be obtained by explicitly specifying this option if it is already on implicitly). Please note, however, that in 16-bit and 32-bit modes, the BP register doesn't have the ability to provide access to 8-bit parts of it like AX has (AL and AH).
Since this option, besides allowing the compiler to use EBP as a general-purpose register in optimizations, also prevents generating exit and entry code for the stack frame, which complicates the debugging – that's why the GCC documentation explicitly states (unusually emphasizing with a bold style) that enabling this option makes debugging impossible on some machines.
Please also be aware that other compiler options, related to debugging or optimization, may implicitly turn the -fomit-frame-pointer option ON or OFF.
I didn't find any official information at gcc.gnu.org about how do other options affect -fomit-frame-pointer on x86 platforms,
the https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html only states the following:
-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.
So it is not clear from the documentation per se whether -fomit-frame-pointer will be turned on if you just compile with a single `-O' option on x86 platform. It may be tested empirically, but in this case there is no commitment from the GCC developers to not change the behavior of this option in the future without notice.
However, Peter Cordes has pointed out in comments that there is a difference for the default settings of the -fomit-frame-pointer between x86-16 platforms and x86-32/64 platforms.
This option – -fomit-frame-pointer – is also relevant to the Intel C++ Compiler 15.0, not only to the GCC:
For the Intel Compiler, this option has an alias /Oy.
Here is what Intel wrote about it:
These options determine whether EBP is used as a general-purpose register in optimizations. Options -fomit-frame-pointer and /Oy allow this use. Options -fno-omit-frame-pointer and /Oy- disallow it.
Some debuggers expect EBP to be used as a stack frame pointer, and cannot produce a stack back-trace unless this is so. The -fno-omit-frame-pointer and /Oy- options direct the compiler to generate code that maintains and uses EBP as a stack frame pointer for all functions so that a debugger can still produce a stack back-trace without doing the following:
For -fno-omit-frame-pointer: turning off optimizations with -O0
For /Oy-: turning off /O1, /O2, or /O3 optimizations
The -fno-omit-frame-pointer option is set when you specify option -O0 or the -g option. The -fomit-frame-pointer option is set when you specify option -O1, -O2, or -O3.
The /Oy option is set when you specify the /O1, /O2, or /O3 option. Option /Oy- is set when you specify the /Od option.
Using the -fno-omit-frame-pointer or /Oy- option reduces the number of available general-purpose registers by 1 and can result in slightly less efficient code.
NOTE For Linux* systems: There is currently an issue with GCC 3.2 exception handling. Therefore, the Intel compiler ignores this option when GCC 3.2 is installed for C++ and exception handling is turned on (the default).
Please be aware that the above quote is only relevant for the Intel C++ 15 compiler, not to GCC.
I haven't come across the term "activation record" before, but I would assume it reffers to what is normally called a "stack frame". That is the area on the stack used by the current function.
The frame pointer is a register that holds the address of the current function's stack frame. If a frame pointer is used then on entering the function the old frame pointer is saved to the stack and the frame pointer is set to the stack pointer. On leaving the function the old frame pointer is restored.
Most normal functions don't need a frame pointer for their own operation. The compiler can keep track of the stack pointer offset on all codepaths through the function and generate local variable accesses accordingly.
A frame pointer may be important in some contexts for debugging and exception handling. This is becoming increasingly rare though as modern debugging and exception handling formats are designed to support functions without frame pointers in most cases.
The main time a frame pointer is needed nowadays is if a function uses alloca or variable length arrays. In this case the value of the stack pointer cannot be tracked statically.
The only thing that I know about the mechanism of how C passes values is that it is done either through a register or the stack.
Register or Stack? Exactly how?
Both. And the conventions will vary by platform.
On x86, values are usually passed by stack. On x64, passing by register is preferred.
In all cases, if you have too many parameters, some will have to be passed by stack.
Refer to x86 calling conventions
Typically (some compilers will do it differently as pointed out) for normal function calls they are passed on the stack. That is usually it is a series of push instructions that just put the data onto the stack.
There are special cases such as system calls where parameters get passed via assembly instructions and registers. In hardware cases they are passed via registers or even certain interrupt signals which consequently write to registers.
On architectures with high numbers of registers they are usually passed via registers such as some RISC and 64 bit architectures.