I'm trying to create a minimal 64-bit Windows executable to better understand how the Windows executable format works.
I wrote very basic assembly and C code as follows.
hi.s
section .text
hi:
db "hi", 0
global sayHi
align 16
sayHi:
lea rax, [rel hi]
ret
start.c
extern int puts();
extern const char *sayHi();
void start() {
puts(sayHi());
}
compiled with,
nasm -fwin64 hi.s
gcc -c -ostart.obj -O3 -fno-optimize-sibling-calls start.c
# I will explain the flag
and linked with,
golink /fo r.exe /console start.obj hi.obj msvcrt.dll
# create a console application `r.exe`
# the default entry point is `start`
The program runs fine and prints hi, but note the gcc flag -fno-optimize-sibling-calls. That flag disables tail-call optimizations so that the program always allocates stack space and calls a function. Without the flag, the program crashes.
This is the disassembled result without tail-call optimization. Not sure why gcc put a nop there, but otherwise it's very simple and runs fine.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: e8 ff 2f 00 00 call 0x404010 # puts
401011: 90 nop
401012: 48 83 c4 28 add rsp,0x28
401016: c3 ret
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
This is when tail-call opt is enabled, in which the program crashes.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: 48 83 c4 28 add rsp,0x28
401010: e9 eb 2f 00 00 jmp 0x404000 # puts
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
Now the program doesn't allocate stack space before puts and simply does a jmp instead of call.
I investigated further to see where exactly it jumps when calling puts.
In the no-tail-call case, the called address 0x404010 in the .idata section has the instruction jmp QWORD PTR [rip+0xffffffffffffffea] # 0x404000, and 0x404000 seems to contain the address to puts.
However in the tail-call case, the called address 0x404000 has 54 40 00 00 which is no meaningful instruction. The debugger says the program segfaults at 0x404003, so I'm pretty sure the program chokes trying to execute a garbage instruction.
I must be doing something wrong, but I'm not sure which, so could you explain why the tail-call case fails and how to get it work?
The problem was on golink not correctly handling tail-calls. I searched a while to make GNU ld link the program with the same options given to golink.
You can create a console-mode Windows executable by GNU ld with this command.
ld -o... --subsystem=console object-files...
--subsystem console or -subsystem=console also means the same. Use --subsystem=windows to create a GUI application.
GNU ld also handles Windows dll files, so in this case, simply giving ld a copy of msvcrt.dll from the system folder worked.
I have a written piece of code in assembly and at some points of it, I want to jump to a label in C. So I have the following code (shortened version but still, I am having the same problem):
#include <stdio.h>
#define JE asm volatile("jmp end");
int main(){
printf("hi\n");
JE
printf("Invisible\n");
end:
printf("Visible\n");
return 0;
}
This code compiles, but there is no end label in the disassembled version of the code.
If I change the label name from end to any other thing (let's say l1, both in asm code(jmp l1) and in the C code), the compiler says that
main.c:(.text+0x6b): undefined reference to `l1'
collect2: error: ld returned 1 exit status
Makefile:2: recipe for target 'main' failed
make: *** [main] Error 1
I have tried different things(different length, different cases, upper, lower, etc.) and I think it only compiles with end label. And with end label, I am receiving segmentation fault because, there is no end label in the disassembled version.
Compiled with: gcc -O0 main.c -o main
Disassembled code:
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d af 00 00 00 lea 0xaf(%rip),%rdi # 6f4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts#plt>
64a: e9 c9 09 20 00 jmpq 201018 <_end> # there is no _end label!
64f: 48 8d 3d a1 00 00 00 lea 0xa1(%rip),%rdi # 6f7 <_IO_stdin_used+0x7>
656: e8 b5 fe ff ff callq 510 <puts#plt>
65b: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 701 <_IO_stdin_used+0x11>
662: e8 a9 fe ff ff callq 510 <puts#plt>
667: b8 00 00 00 00 mov $0x0,%eax
66c: 5d pop %rbp
66d: c3 retq
66e: 66 90 xchg %ax,%ax
So, the questions are:
Am I doing something wrong? I have seen this kind of jumps (from
assembly to C) in codes. I can provide example links.
Why the compiler/linker cannot find l1 but can find end?
This is what asm goto is for. GCC Inline Assembly: Jump to label outside block
Note that defining a label inside another asm statement will sometimes work (e.g. with optimization disabled) but IS NOT SAFE.
asm("end:"); // BROKEN; NEVER USE
// except for toy experiments to look at compiler output
GNU C does not define the behaviour of jumping from one asm statement to another without asm goto. The compiler is allowed to assume that execution comes out the end of an asm statement and e.g. put a store after it.
The C end: label within a given function won't just have the asm symbol name of end or _end: - that wouldn't make sense because separate C functions are each allowed to have their own end: label. It could be something like main.end but it turns out GCC and clang just use their usual autonumbered labels like .L123.
Then how this code works: https://github.com/IAIK/transientfail/blob/master/pocs/spectre/PHT/sa_oop/main.c
It doesn't; the end label that asm volatile("je end"); references is in the .data section and happens to be defined by the compiler or linker to mark the end of that section.
asm volatile("je end") has no connection to the C label in that function.
I commented out some of the code in other functions to get it to compile without the "cacheutils.h" header but that didn't affect that part of the oop() function; see https://godbolt.org/z/jabYu3 for disassembly of the linked executable with JE_4k changed to JE_16 so it's not huge. It's disassembly of a linked executable so you can see the numeric address of je 6010f0 <_end> while the oop function itself starts at 4006e0 and ends at 400750. (So it doesn't contain the branch target).
If this happens to work for Spectre exploits, that's because apparently the branch is never actually taken.
I know fopen() is in the C standard library, so that I can definitely call the fopen() function in a C program. What I am confused about is why I can call the open() function as well. open() should be a system call, so it is not a C function in the standard library. As I am successfully able to call the open() function, am I calling a C function or a system call?
EJP's comments to the question and Steve Summit's answer are exactly to the point: open() is both a syscall and a function in the standard C library; fopen() is a function in the standard C library, that sets up a file handle -- a data structure of type FILE that contains additional stuff like optional buffering --, and internally calls open() also.
In the hopes to further understanding, I shall show hello.c, an example Hello world -program written in C for Linux on 64-bit x86 (x86-64 AKA AMD64 architecture), which does not use the standard C library at all.
First, hello.c needs to define some macros with inline assembly for us to be able to call the syscalls. These are very architecture- and operating system dependent, which is why this only works in Linux on x86-64 architecture:
/* Freestanding Hello World example in Linux on x86_64/x86.
* Compile using
* gcc -march=x86-64 -mtune=generic -m64 -ffreestanding -nostdlib -nostartfiles hello.c -o hello
*/
#define STDOUT_FILENO 1
#define EXIT_SUCCESS 0
#ifndef __x86_64__
#error This program only works on x86_64 architecture!
#endif
#define SYS_write 1
#define SYS_exit 60
#define SYSCALL1_NORET(nr, arg1) \
__asm__ ( "syscall\n\t" \
: \
: "a" (nr), "D" (arg1) \
: "rcx", "r11" )
#define SYSCALL3(retval, nr, arg1, arg2, arg3) \
__asm__ ( "syscall\n\t" \
: "=a" (retval) \
: "a" (nr), "D" (arg1), "S" (arg2), "d" (arg3) \
: "rcx", "r11" )
The Freestanding in the comment at the beginning of the file refers to "freestanding execution environment"; it is the case when there is no C library available at all. For example, the Linux kernel is written the same way. The normal environment we are familiar with is called "hosted execution environment", by the way.
Next, we can define two functions, or "wrappers", around the syscalls:
static inline void my_exit(int retval)
{
SYSCALL1_NORET(SYS_exit, retval);
}
static inline int my_write(int fd, const void *data, int len)
{
int retval;
if (fd == -1 || !data || len < 0)
return -1;
SYSCALL3(retval, SYS_write, fd, data, len);
if (retval < 0)
return -1;
return retval;
}
Above, my_exit() is roughly equivalent to C standard library exit() function, and my_write() to write().
The C language does not define any kind of a way to do a syscall, so that is why we always need a "wrapper" function of some sort. (The GNU C library does provide a syscall() function for us to do any syscall we wish -- but the point of this example is to not use the C library at all.)
The wrapper functions always involve a bit of (inline) assembly. Again, since C does not have a built-in way to do a syscall, we need to "extend" the language by adding some assembly code. This (inline) assembly, and the syscall numbers, is what makes this example, operating system and architecture dependent. And yes: the GNU C library, for example, contains the equivalent wrappers for quite a few architectures.
Some of the functions in the C library do not use any syscalls. We also need one, the equivalent of strlen():
static inline int my_strlen(const char *str)
{
int len = 0L;
if (!str)
return -1;
while (*str++)
len++;
return len;
}
Note that there is no NULL used anywhere in the above code. It is because it is a macro defined by the C library. Instead, I'm relying on "logical null": (!pointer) is true if and only if pointer is a zero pointer, which is what NULL is on all architectures in Linux. I could have defined NULL myself, but I didn't, in the hopes that somebody might notice the lack of it.
Finally, main() itself is something the GNU C library calls, as in Linux, the actual start point of the binary is called _start. The _start is provided by the hosted runtime environment, and initializes the C library data structures and does other similar preparations. Our example program is so simple we do not need it, so we can just put our simple main program part into _start instead:
void _start(void)
{
const char *msg = "Hello, world!\n";
my_write(STDOUT_FILENO, msg, my_strlen(msg));
my_exit(EXIT_SUCCESS);
}
If you put all of the above together, and compile it using
gcc -march=x86-64 -mtune=generic -m64 -ffreestanding -nostdlib -nostartfiles hello.c -o hello
per the comment at the start of the file, you will end up with a small (about two kilobytes) static binary, that when run,
./hello
outputs
Hello, world!
You can use file hello to examine the contents of the file. You could run strip hello to remove all (unneeded) symbols, reducing the file size further down to about one and a half kilobytes, if file size was really important. (It will make the object dump less interesting, however, so before you do that, check out the next step first.)
We can use objdump -x hello to examine the sections in the file:
hello: file format elf64-x86-64
hello
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x00000000004001e1
Program Header:
LOAD off 0x0000000000000000 vaddr 0x0000000000400000 paddr 0x0000000000400000 align 2**21
filesz 0x00000000000002f0 memsz 0x00000000000002f0 flags r-x
NOTE off 0x0000000000000120 vaddr 0x0000000000400120 paddr 0x0000000000400120 align 2**2
filesz 0x0000000000000024 memsz 0x0000000000000024 flags r--
EH_FRAME off 0x000000000000022c vaddr 0x000000000040022c paddr 0x000000000040022c align 2**2
filesz 0x000000000000002c memsz 0x000000000000002c flags r--
STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-
Sections:
Idx Name Size VMA LMA File off Algn
0 .note.gnu.build-id 00000024 0000000000400120 0000000000400120 00000120 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 000000d9 0000000000400144 0000000000400144 00000144 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .rodata 0000000f 000000000040021d 000000000040021d 0000021d 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .eh_frame_hdr 0000002c 000000000040022c 000000000040022c 0000022c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .eh_frame 00000098 0000000000400258 0000000000400258 00000258 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .comment 00000034 0000000000000000 0000000000000000 000002f0 2**0
CONTENTS, READONLY
SYMBOL TABLE:
0000000000400120 l d .note.gnu.build-id 0000000000000000 .note.gnu.build-id
0000000000400144 l d .text 0000000000000000 .text
000000000040021d l d .rodata 0000000000000000 .rodata
000000000040022c l d .eh_frame_hdr 0000000000000000 .eh_frame_hdr
0000000000400258 l d .eh_frame 0000000000000000 .eh_frame
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 l df *ABS* 0000000000000000 hello.c
0000000000400144 l F .text 0000000000000016 my_exit
000000000040015a l F .text 000000000000004e my_write
00000000004001a8 l F .text 0000000000000039 my_strlen
0000000000000000 l df *ABS* 0000000000000000
000000000040022c l .eh_frame_hdr 0000000000000000 __GNU_EH_FRAME_HDR
00000000004001e1 g F .text 000000000000003c _start
0000000000601000 g .eh_frame 0000000000000000 __bss_start
0000000000601000 g .eh_frame 0000000000000000 _edata
0000000000601000 g .eh_frame 0000000000000000 _end
The .text section contains our code, and .rodata immutable constants; here, just the Hello, world! string literal. The rest of the sections are stuff the linker adds and the system uses. We can see that we have f(hex) = 15 bytes of read-only data, and d9(hex) = 217 bytes of code; the rest of the file (about a kilobyte or so) is ELF stuff added by the linker for the kernel to use when executing this binary.
We can even examine the actual assembly code contained in hello, by running objdump -d hello:
hello: file format elf64-x86-64
Disassembly of section .text:
0000000000400144 <my_exit>:
400144: 55 push %rbp
400145: 48 89 e5 mov %rsp,%rbp
400148: 89 7d fc mov %edi,-0x4(%rbp)
40014b: b8 3c 00 00 00 mov $0x3c,%eax
400150: 8b 55 fc mov -0x4(%rbp),%edx
400153: 89 d7 mov %edx,%edi
400155: 0f 05 syscall
400157: 90 nop
400158: 5d pop %rbp
400159: c3 retq
000000000040015a <my_write>:
40015a: 55 push %rbp
40015b: 48 89 e5 mov %rsp,%rbp
40015e: 89 7d ec mov %edi,-0x14(%rbp)
400161: 48 89 75 e0 mov %rsi,-0x20(%rbp)
400165: 89 55 e8 mov %edx,-0x18(%rbp)
400168: 83 7d ec ff cmpl $0xffffffff,-0x14(%rbp)
40016c: 74 0d je 40017b <my_write+0x21>
40016e: 48 83 7d e0 00 cmpq $0x0,-0x20(%rbp)
400173: 74 06 je 40017b <my_write+0x21>
400175: 83 7d e8 00 cmpl $0x0,-0x18(%rbp)
400179: 79 07 jns 400182 <my_write+0x28>
40017b: b8 ff ff ff ff mov $0xffffffff,%eax
400180: eb 24 jmp 4001a6 <my_write+0x4c>
400182: b8 01 00 00 00 mov $0x1,%eax
400187: 8b 7d ec mov -0x14(%rbp),%edi
40018a: 48 8b 75 e0 mov -0x20(%rbp),%rsi
40018e: 8b 55 e8 mov -0x18(%rbp),%edx
400191: 0f 05 syscall
400193: 89 45 fc mov %eax,-0x4(%rbp)
400196: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
40019a: 79 07 jns 4001a3 <my_write+0x49>
40019c: b8 ff ff ff ff mov $0xffffffff,%eax
4001a1: eb 03 jmp 4001a6 <my_write+0x4c>
4001a3: 8b 45 fc mov -0x4(%rbp),%eax
4001a6: 5d pop %rbp
4001a7: c3 retq
00000000004001a8 <my_strlen>:
4001a8: 55 push %rbp
4001a9: 48 89 e5 mov %rsp,%rbp
4001ac: 48 89 7d e8 mov %rdi,-0x18(%rbp)
4001b0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4001b7: 48 83 7d e8 00 cmpq $0x0,-0x18(%rbp)
4001bc: 75 0b jne 4001c9 <my_strlen+0x21>
4001be: b8 ff ff ff ff mov $0xffffffff,%eax
4001c3: eb 1a jmp 4001df <my_strlen+0x37>
4001c5: 83 45 fc 01 addl $0x1,-0x4(%rbp)
4001c9: 48 8b 45 e8 mov -0x18(%rbp),%rax
4001cd: 48 8d 50 01 lea 0x1(%rax),%rdx
4001d1: 48 89 55 e8 mov %rdx,-0x18(%rbp)
4001d5: 0f b6 00 movzbl (%rax),%eax
4001d8: 84 c0 test %al,%al
4001da: 75 e9 jne 4001c5 <my_strlen+0x1d>
4001dc: 8b 45 fc mov -0x4(%rbp),%eax
4001df: 5d pop %rbp
4001e0: c3 retq
00000000004001e1 <_start>:
4001e1: 55 push %rbp
4001e2: 48 89 e5 mov %rsp,%rbp
4001e5: 48 83 ec 10 sub $0x10,%rsp
4001e9: 48 c7 45 f8 1d 02 40 movq $0x40021d,-0x8(%rbp)
4001f0: 00
4001f1: 48 8b 45 f8 mov -0x8(%rbp),%rax
4001f5: 48 89 c7 mov %rax,%rdi
4001f8: e8 ab ff ff ff callq 4001a8 <my_strlen>
4001fd: 89 c2 mov %eax,%edx
4001ff: 48 8b 45 f8 mov -0x8(%rbp),%rax
400203: 48 89 c6 mov %rax,%rsi
400206: bf 01 00 00 00 mov $0x1,%edi
40020b: e8 4a ff ff ff callq 40015a <my_write>
400210: bf 00 00 00 00 mov $0x0,%edi
400215: e8 2a ff ff ff callq 400144 <my_exit>
40021a: 90 nop
40021b: c9 leaveq
40021c: c3 retq
The assembly itself is not really that interesting, except that in my_write and my_exit you can see how the inline assembly generated by the SYSCALL...() macro just loads the variables into specific registers, and does the "do syscall" -- which just happens to be an x86-64 assembly instruction also called syscall here; in 32-bit x86 architecture, it is int $80, and yet something else in other architectures.
There is a final wrinkle, related to the reason why I used the prefix my_ for the functions analog to the functions in the C library: the C compiler can provide optimized shortcuts for some C library functions. For GCC, these are listed here; the list includes strlen().
This means we do not actually need the my_strlen() function, because we can use the optimized __builtin_strlen() function GCC provides, even in freestanding environment. The built-ins are usually very optimized; in the case of __builtin_strlen() on x86-64 using GCC-5.4.0, it optimizes to just a couple of register loads and a repnz scasb %es:(%rdi),%al instruction (which looks long, but actually takes just two bytes).
In other words, the final wrinkle is that there is a third type of function, compiler built-ins, that are provided by the compiler (but otherwise just like the functions provided by the C library) in optimized form, depending on the compiler options and architecture used.
If we were to expand the above example so that we'd open a file and write the Hello, world! into it, and compare low-level unistd.h (open()/write()/close()) and standard I/O stdio.h (fopen()/puts()/fclose()) approaches, we'd find that the major difference is in that the FILE handle used by the standard I/O approach contains a lot of extra stuff (that makes the standard file handles quite versatile, just not useful in such a trivial example), most visible in the buffering approach it has. On the assembly level, we'd still see the same syscalls -- open, write, close -- used.
Even though at first glance the ELF format (used for binaries in Linux) contains a lot of "unneeded stuff" (about a kilobyte for our example program above), it is actually a very powerful format. It, and the dynamic loader in Linux, provides a way to auto-load libraries when a program starts (using LD_PRELOAD environment variable), and to interpose functions in other libraries -- essentially, replace them with new ones, but with a way to still be able to call the original interposed version of the function. There are lots of useful tricks, fixes, experiments, and debugging methods these allow.
Although the distinction between "system call" and "library function" can be a useful one to keep in mind, there's the issue that you have to be able to call system calls somehow. In general, then, every system call is present in the C library -- as a thin little library function that does nothing but make the transfer to the system call (however that's implemented).
So, yes, you can call open() from C code if you want to. (And somewhere, perhaps in a file called fopen.c, the author of your C library probably called it too, within the implementation of fopen().)
The starting point for answering your question is to ask another question: What is a system call?
Generally, one thinks of a system call as a procedure that executes at an elevated processor privilege level. Generally, this means switching from user mode to kernel mode (some systems use multiple modes).
The mechanism for and application to enter kernel mode depends upon the system (and one Intel there are multiple ways). The general sequence for invoking a system service is the process executes an instruction that triggers a change processor mode exception. The CPU responds to the exception by invoking the appropriate exception/interrupt handler then dispatches to the appropriate operating system service.
The problem for C programming is that invoking a system service requires executing a specific hardware instruction and setting hardware register values. Operating systems provide wrapper functions that that handle the packing of parameters into registers, triggering the exception, then unpacking the return values from registers.
The open() function usually be a wrapper for high level languages to invoke system services. If you think about, fopen() is generally a "wrapper" for open().
So what we normally think of as a system call is a function that does nothing other than invoke a system service.
I have a question regarding the handling and interpretation of shared libraries.
Suppose, I build a shared object from foo.c using the command:
gcc -shared -fPIC -o libfoo.so foo.c
where foo.c consists of:
#include <stdlib.h>
#include <stdio.h>
int main(void) {
int i;
printf("this is a silly test\n");
if(i)
goto ret;
printf("hello world\n");
ret:
return 0;
}
Now, let's look at the objdump output, specifically that of foo's main:
0000000005ec <main>:
5ec: 55 push %rbp
5ed: 48 89 e5 mov %rsp,%rbp
5f0: 48 83 ec 10 sub $0x10,%rsp
5f4: 48 8d 3d 6b 00 00 00 lea 0x6b(%rip),%rdi # 666 <_fini+0xe>
5fb: e8 00 ff ff ff callq 500 <puts#plt>
600: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
604: 75 0e jne 614 <main+0x28>
606: 48 8d 3d 65 00 00 00 lea 0x65(%rip),%rdi # 672 <_fini+0x1a>
60d: e8 ee fe ff ff callq 500 <puts#plt>
612: eb 01 jmp 615 <main+0x29>
614: 90 nop
615: b8 00 00 00 00 mov $0x0,%eax
61a: c9 leaveq
61b: c3 retq
61c: 90 nop
61d: 90 nop
61e: 90 nop
61f: 90 nop
I can clearly see that calls to puts are being redirected to the PLT, as expected. However, what I don't understand are the instructions at 604 and 612. They are not relative to the IP, nor a call to the PLT. They use an absolute address, based on the
symbol main.
How could this shared library then possibly be used simultaneously betwen several processes? It could (and should) be loaded at different virtual addresses, but the point is that each process should share the implementation stored in RAM. How can different processes with main loaded at different virtual addresses share the instructions at 604 and 612?
They're not absolute jumps, they're PC relative jumps. In fact ALL direct jump and call instructions are PC-relative on x86 -- there are NO absolute direct jumps (so if you want an absolute jump, it has to be indirect).
The reason the callq instruction use the PLT is because the target symbol might be in a different shared object, so even a relative branch won't work (the other shared object might be loaded at any address, independently of this shared object). The PLT itself is actually a small piece of code within the shared object, with a single (indirect) absolute jump for each remote symbol. When the shared objects a dynamically loaded, the absolute address is set appropriately, so when the code runs, the callq instruction will be a pc-relative branch (call) to the PLT which consists of a single indirect jump to the puts routine.
I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.