Function Prologue and Epilogue removed by GCC Optimization - c

Taking an empty program
//demo.c
int main(void)
{
}
Compiling the program at default optimization.
gcc -S demo.c -o dasm.asm
I get the assembly output as
//Removed labels and directive which are not relevant
main:
pushl %ebp // prologue of main
movl %esp, %ebp // prologue of main
popl %ebp // epilogue of main
ret
Now Compiling the program at -O2 optimization.
gcc -O2 -S demo.c -o dasm.asm
I get the optimized assembly
main:
rep
ret
In my initial search , i found that the optimization flag -fomit-frame-pointer was responsible for removing the prologue and epilogue.
I found more information about the flag , in the gcc compiler manual.But could not understand this reason below , given by the manual , for removing the prologue and epilogue.
Don't keep the frame pointer in a register for functions that don't
need one.
Is there any other way , of putting the above reason ?
What is the reason for "rep" instruction , appearing at -02 optimization ?
Why does main function , not require a stack frame initialization ?
If the setting up of the frame pointer , is not done from within the main function , then who does this job ?
Is it done by the OS or is it the functionality of the hardware ?

Compilers are getting smart, it knew you didn't need a stack frame pointer stored in a register because whatever you put into your main() function didn't use the stack.
As for rep ret:
Here's the principle. The processor tries to fetch the next few
instructions to be executed, so that it can start the process of
decoding and executing them. It even does this with jump and return
instructions, guessing where the program will head next.
What AMD says here is that, if a ret instruction immediately follows a
conditional jump instruction, their predictor cannot figure out where
the ret instruction is going. The pre-fetching has to stop until the
ret actually executes, and only then will it be able to start looking
ahead again.
The "rep ret" trick apparently works around the problem, and lets the
predictor do its job. The "rep" has no effect on the instruction.
Source: Some forum, google a sentence to find it.
One thing to note is that just because there is no prologue it doesn't mean there is no stack, you can still push and pop with ease it's just that complex stack manipulation will be difficult.
Functions that don't have prologue/epilogue are usually dubbed naked. Hackers like to use them a lot because they don't contaminate the stack when you jmp to them, I must confess I know of no other use to them outside optimization. In Visual Studio it's done via:
__declspec(naked)

Related

Linking and calling printf from gas assembly

There are a few related questions to this which I've come across, such as Printf with gas assembly and Calling C printf from assembly but I'm hoping this is a bit different.
I have the following program:
.section .data
format:
.ascii "%d\n"
.section .text
.globl _start
_start:
// print "55"
mov $format, %rdi
mov $55, %rsi
mov $0, %eax
call printf # how to link?
// exit
mov $60, %eax
mov $0, %rdi
syscall
Two questions related to this:
Is it possible to use only as (gas) and ld to link this to the printf function, using _start as the entry point? If so, how could that be done?
If not, other than changing _start to main, what would be the gcc invocation to run things properly?
It is possible to use ld, but not recommended: if you use libc functions, you need to initialise the C runtime. That is done automatically if you let the C compiler provide _start and start your program as main. If you use the libc but not the C runtime initialisation code, it may seem to work, but it can also lead to strange spurious failure.
If you start your program from main (your second case) instead, it's as simple as doing gcc -o program program.s where program.s is your source file. On some Linux distributions you may also need to supply -no-pie as your program is not written in PIC style (don't worry about this for now).
Note also that I recommend not mixing libc calls with raw system calls. Instead of doing a raw exit system call, call the C library function exit. This lets the C runtime deinitialise itself correctly, including flushing any IO streams.
Now if you assemble and link your program as I said in the first paragraph, you'll notice that it might crash. This is because the stack needs to be aligned to a multiple of 16 bytes on calls to functions. You can ensure this alignment by pushing a qword of data on the stack at the beginning of each of your functions (remember to pop it back off at the end).

Prologue in x86 Assembly and Pushing Callee Save Registers

While studying dis-assembly of C code this struck me. Generally, in the assembly of functions after saving frame pointer we push callee saved registers and revive them back just before return. x86 ABI tells us which registers are callee/caller save. However my problem starts when I see that compiler behaves differently in assembling those functions. For example:
Case 1
(gdb) disassemble EVP_CipherInit_ex
Dump of assembler code for function EVP_CipherInit_ex:
0xb1258044 <+0>: push %ebp
0xb1258045 <+1>: mov %esp,%ebp
0xb1258047 <+3>: push %edi
0xb1258048 <+4>: push %esi
0xb1258049 <+5>: push %ebx
Case 2
(gdb) disassemble FIPS_mode
Dump of assembler code for function FIPS_mode:
0xb12614c4 <+0>: push %ebp
0xb12614c5 <+1>: mov %esp,%ebp
0xb12614c7 <+3>: push %ebx
0xb12614c8 <+4>: sub $0x4,%esp
Case 3
(gdb) disassemble OPENSSL_init
Dump of assembler code for function OPENSSL_init:
0xb124fae4 <+0>: push %ebp
0xb124fae5 <+1>: mov %esp,%ebp
0xb124fae7 <+3>: push %ebx
0xb124fae8 <+4>: sub $0x4,%esp
Case 4
(gdb) disassemble FIPS_module_mode
Dump of assembler code for function FIPS_module_mode:
0xb117dfdc <+0>: push %edi
0xb117dfdd <+1>: push %esi
0xb117dfde <+2>: push %ebx
0xb117dfdf <+3>: sub $0x10,%esp
Q1. In first three cases we saved frame pointer ebp, and another common register ebx but rest of the things vary. How does compiler identifies which ones to push and which ones to avoid? Is this some kind of optimization playing its game? Any pointers on this will be very helpful.
Q2. In the dis-assembly of FIPS_module_mode we have not even saved frame pointer ebp. I know that we can save space by optimizing that with a compiler option. My interest is in understanding whether this absence of frame pointer part is due to that explicit compiler optimization or are there certain other parameters that help in deciding this.
Q3. How does a debugger like gdb detects that for a specific function the in case 4, a frame pointer is omitted in the core-dump?
The prototypes of the functions posted are:
int FIPS_module_mode(void);
void OPENSSL_init(void);
int EVP_CipherInit_ex(EVP_CIPHER_CTX *ctx, const EVP_CIPHER *cipher,
ENGINE *impl, const unsigned char *key,
const unsigned char *iv, int enc);
int FIPS_mode(void);
This is running on NetBSD5 and coredump analyzed by gdb
Q1. gcc (like other optimizing compilers) compiles the whole function, using as many callee-saved registers as is useful, but only as many as needed. The asm isn't generated until gcc is finished optimizing the whole function (or compilation-unit, or program), so gcc knows how many registers it will need when it's emitting the prologue.
Any callee-saved register it uses is pushed in the prologue and popped in the epilogue. In some functions, it uses callee-saved register just because it runs out of caller-saved registers it can use without saving (so, just for number of total number of registers). In non-leaf functions, callee-saved registers are also useful for keeping something in a register across a call, which gcc must assume clobbers all the caller-saved registers.
It looks like if gcc only needs one call-preserved register, it chooses ebx. It might use (save/restore) just esi/edi if it wanted to use a rep movs or something, though.
gcc's behaviour is sub-optimal sometimes: some functions have a fast-path that doesn't use many locals, but gcc emits code that pushes before checking, and thus has to pop again. The Linux kernel hints some functions as noinline to keep the fast-path as fast as possible, at the expense of an extra function call in the slow path. As I understand it, this is the main reason for noinline in Linux, rather than code-size bloat.
Q2. Yes, it looks like FIPS_module_mode was compiled with -fomit-frame-pointer (which is the default in newer gcc). If you're looking at a library, the Makefile (or whatever build system) could easily have built different files with different options. Or, even with -fomit-frame-pointer, functions with variable-size local variables do build a stack frame. e.g.
int func(int c) { int tmp[c]; ...; }
Q3. I got curious about how modern debuggers do stack backtraces without frame pointers. This blog post sheds some light: there is debug info in the .eh_frame_hdr data section (not marked as "debug" info, so it doesn't normally get stripped, so you can backtrace when the call stack went through a function in a stripped library or something). Use objdump -h to see the size of that section. That data is also used for unwinding the stack if/when a runtime exception is thrown, so that's another reason for not stripping it.
In normal situations (barring bugs that clobber the stack, or compiler / asm-programming errors that mess up the stack pointer), it works without frame pointers, so -fomit-frame-pointer is the default in gcc since 4.6, even for x86. I think it was the default for longer for x86-64.
Without that info, you could scan the stack for values that are in the right range to be return addresses.

int 80 doesn't appear in assembly code

Problem
Let's consider:
int main(){
write(1, "hello", 5);
return 0;
}
I am reading a book that suggests the assembly output for the above code should be:
main:
mov $4, %eax
mov $1 %ebx
mov %string, %ecx
mov $len, %edx
int $0x80
(The above code was compiled with 32 bit architecture. Passing arguments by registers isn't caused by '64 bit convention passing arguments by registers' but it is caused by the fact, we make a syscall. )
And the output on my 64 bit Ubuntu machine with: gcc -S main.c -m32
is:
pushl $4
pushl $string
pushl $1
call write
My doubts
So it confused me. Why did gcc compile it as "normal" call, not as syscall.
In this situation, what is the way to make the processor use a kernel function (like write)?
I am reading a book that suggests the assembly output for the above code should be ...
You shouldn't believe everything you read :-)
There is no requirement that C code be turned into specific assembly code, the only requirement that the C standard mandates is that the resulting code behave in a certain manner.
Whether that's done by directly calling the OS system call with int $80 (or sysenter), or whether it's done by calling a library routine write() which eventually calls the OS in a similar fashion, is largely irrelevant.
If you were to locate and disassemble the write() code, you may well find it simply reads those values off the stack into registers and then calls the OS in much the same way as the code you've shown containing int $80.
As an aside, what if you wanted to port gcc to a totally different architecture that uses call 5 to do OS-level system calls. If gcc is injecting specific int $80 calls into the assembly stream, that's not going to work too well.
But, if it's injecting a call to a write() function, all you have to do is make sure you link it with the correct library containing a modified write() function (one that does call 5 rather than int $80).

What is the use of _start() in C?

I learned from my colleague that one can write and execute a C program without writing a main() function. It can be done like this:
my_main.c
/* Compile this with gcc -nostartfiles */
#include <stdlib.h>
void _start() {
int ret = my_main();
exit(ret);
}
int my_main() {
puts("This is a program without a main() function!");
return 0;
}
Compile it with this command:
gcc -o my_main my_main.c –nostartfiles
Run it with this command:
./my_main
When would one need to do this kind of thing? Is there any real world scenario where this would be useful?
The symbol _start is the entry point of your program. That is, the address of that symbol is the address jumped to on program start. Normally, the function with the name _start is supplied by a file called crt0.o which contains the startup code for the C runtime environment. It sets up some stuff, populates the argument array argv, counts how many arguments are there, and then calls main. After main returns, exit is called.
If a program does not want to use the C runtime environment, it needs to supply its own code for _start. For instance, the reference implementation of the Go programming language does so because they need a non-standard threading model which requires some magic with the stack. It's also useful to supply your own _start when you want to write really tiny programs or programs that do unconventional things.
While main is the entry point for your program from a programmers perspective, _start is the usual entry point from the OS perspective (the first instruction that is executed after your program was started from the OS)
In a typical C and especially C++ program, a lot of work has been done before the execution enters main. Especially stuff like initialization of global variables. Here you can find a good explanation of everything that's going on between _start() and main() and also after main has exited again (see comment below).
The necessary code for that is usually provided by the compiler writers in a startup file, but with the flag –nostartfiles you essentially tell the compiler: "Don't bother giving me the standard startup file, give me full control over what is happening right from the start".
This is sometimes necessary and often used on embedded systems. E.g. if you don't have an OS and you have to manually enable certain parts of your memory system (e.g. caches) before the initialization of your global objects.
Here is a good overview of what happens during program startup before main. In particular, it shows that __start is the actual entry point to your program from OS viewpoint.
It is the very first address from which the instruction pointer will start counting in your program.
The code there invokes some C runtime library routines just to do some housekeeping, then call your main, and then bring things down and call exit with whatever exit code main returned.
A picture is worth a thousand words:
P.S: this answer is transplanted from another question which SO has helpfully closed as duplicate of this one.
When would one need to do this kind of thing?
When you want your own startup code for your program.
main is not the first entry for a C program, _start is the first entry behind the curtain.
Example in Linux:
_start: # _start is the entry point known to the linker
xor %ebp, %ebp # effectively RBP := 0, mark the end of stack frames
mov (%rsp), %edi # get argc from the stack (implicitly zero-extended to 64-bit)
lea 8(%rsp), %rsi # take the address of argv from the stack
lea 16(%rsp,%rdi,8), %rdx # take the address of envp from the stack
xor %eax, %eax # per ABI and compatibility with icc
call main # %edi, %rsi, %rdx are the three args (of which first two are C standard) to main
mov %eax, %edi # transfer the return of main to the first argument of _exit
xor %eax, %eax # per ABI and compatibility with icc
call _exit # terminate the program
Is there any real world scenario where this would be useful?
If you mean, implement our own _start:
Yes, in most of the commercial embedded software I have worked with, we need to implement our own _start regarding to our specific memory and performance requirements.
If you mean, drop the main function and change it to something else:
No, I don't see any benefit doing that.

GCC not saving/restoring reserved registers on function calls

I have a scenario in GCC causing me problems. The behaviour I get is not the behaviour I expect. To summarise the situation, I am proposing several new instructions for x86-64 which are implemented in a hardware simulator. In order to test these instructions I am taking existing C source code and handcoding the new instructions using hexidecimal. Because these instructions interact with the existing x86-64 registers, I use the input/output/clobber lists to declare dependencies for GCC.
What's happening is that if I call a function e.g. printf, the dependent registers aren't saved and restored.
For example
register unsigned long r9 asm ("r9") = 101;
printf("foo %s\n", "bar");
asm volatile (".byte 0x00, 0x00, 0x00, 0x00" : /* no output */ : "q" (r9) );
101 was assigned to r9 and the inline assembly (fake in this example) is dependent on r9. This runs correctly in the absence of the printf, but when it is there GCC does not save and restore r9 and another value is there by the time my custom instruction is called.
I thought perhaps that GCC might have secretly changed the assignment to the variable r9, but when I do this
asm volatile (".byte %0" : /* no output */ : "q" (r9) );
and look at the assembly output, it is indeed using %r9.
I am using gcc 4.4.5. What do you think might be happening? I thought GCC will always save and restore registers on function calls. Is there some way I can enforce it?
Thanks!
EDIT: By the way, I'm compiling the program like this
gcc -static -m64 -mmmx -msse -msse2 -O0 test.c -o test
The ABI, section 3.2.1 says:
Registers %rbp, %rbx and %r12 through %r15 “belong” to the calling function and the
called function is
required to preserve their values. In other words, a called function must preserve
these registers’ values for its caller. Remaining registers “belong” to the called
function. If a calling function wants to preserve such a register value across a
function call, it must save the value in its local stack frame.
so you shouldn't expect registers other than %rbp, %rbx and %r12 through %r15 to be preserved by a function call.
gcc will not make explicit-register variables like this callee-saved. Basically this register notation you're using makes the variable a direct alias for the register, with the assumption you want to be able to read back the value a callee leaves in the register. If you used a callee-saved register instead of a call-clobbered (caller-saved) register, the problem would go away.

Resources