I have a simple function in C language, in separate file string.c:
void var_init(){
char *hello = "Hello";
}
compiled with:
gcc -ffreestanding -c string.c -o string.o
And then I use command
objdump -d string.o
to see disassemble listing. What I got is:
string.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <var_init>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # b <var_init+0xb>
b: 48 89 45 f8 mov %rax,-0x8(%rbp)
f: 90 nop
10: 5d pop %rbp
11: c3 retq
I lost in understanding this listing. The book "Writing OS from scratch" says something about old disassembly and slightly uncover the mistery, but their listing is completely different and I even not see that data interpreted as code in mine as author says.
In addition to the explaination from #VladfromMoscow, Just thought it might be helpful for the poster to see what happens when you compile to assembly rather than using objdump to see it, as the data can be seen more plainly then (IMO) and the RIP relative addressing may make a bit more sense.
gcc -S x.s
Yields
.file "x.c"
.text
.section .rodata
.LC0:
.string "Hello"
.text
.globl var_init
.type var_init, #function
var_init:
.LFB0:
pushq %rbp
movq %rsp, %rbp
leaq .LC0(%rip), %rax
movq %rax, -8(%rbp)
nop
popq %rbp
ret
.LFE0:
.size var_init, .-var_init
.ident "GCC: (Alpine 8.3.0) 8.3.0"
.section .note.GNU-stack,"",#progbits
This command
lea 0x0(%rip),%rax
stores the address of the string literal in the register rax.
And this command
mov %rax,-0x8(%rbp)
copies the address from the register rax into the allocated stack memory. The address occupies 8 bytes as it is seen from the offset in the stack -0x8.
This store only happens at all because you compiled in debug mode; it would normally be optimized away. The next thing that happens is that the local vars (in the red-zone below the stack pointer) are effectively discarded as the function tears down its stack frame and returns.
The material you're looking at probably included a sub $16, %rsp or similar to allocate space for locals below RBP, then deallocating that space later; the x86-64 System V ABI doesn't need that in leaf functions (that don't call any other functions); they can just use the read zone. (See also Where exactly is the red zone on x86-64?). Or compile with gcc -mno-red-zone, which you probably want anyway for freestanding code: Why can't kernel code use a Red Zone
Then it restores the saved value of the caller's RBP (which was earlier set up as a frame pointer; notice that space for locals was addressed relative to RBP).
pop %rbp
and exits, effectively popping the return address into RIP
retq
Related
I have compiled a simple hello world c code with gcc -fpie test.c, and now looking at the binary using objdump:
Disassembly of section __TEXT,__text:
__text:
100000f40: 55 pushq %rbp
100000f41: 48 89 e5 movq %rsp, %rbp
100000f44: 48 83 ec 10 subq $16, %rsp
100000f48: 89 7d fc movl %edi, -4(%rbp)
100000f4b: 8b 75 fc movl -4(%rbp), %esi
100000f4e: 48 8d 3d 5d 00 00 00 leaq 93(%rip), %rdi
100000f55: b0 00 movb $0, %al
...
Are these virtual (runtime) addresses considering I have compiled with -fpie? what are they used for if code is position independent.
If I remove the fpie I do get the same addresses on the left, and I'm assuming they are virtual addresses where these instructions will be loaded to correct?
In a PIE (Position Independent Executable), those "addresses" are actually just relative offsets from the base virtual address of your program. When the program is launched, it will be loaded into memory by the kernel loader at some 0x<base_addr>, and your __text section in this case will be at 0x<base_addr> + 0x100000f40.
Note that the base virtual address will change for every execution if you have ASLR (address space layout randomization) enabled, which on any modern system is enabled by default.
This question already has answers here:
Why does GCC pad functions with NOPs?
(3 answers)
Closed 7 years ago.
On OSX 64bit, compiling a dummy C program like that:
#include <stdio.h>
void foo1() {
}
void foo2() {
}
int main() {
printf("Helloooo!\n");
foo1();
foo2();
return 0;
}
Produces the following ASM code (obtained disassembling the binary with otool):
(__TEXT,__text) section
_foo1:
0000000100000f10 55 pushq %rbp
0000000100000f11 4889e5 movq %rsp, %rbp
0000000100000f14 897dfc movl %edi, -0x4(%rbp)
0000000100000f17 5d popq %rbp
0000000100000f18 c3 retq
0000000100000f19 0f1f8000000000 nopl (%rax)
_foo2:
0000000100000f20 55 pushq %rbp
0000000100000f21 4889e5 movq %rsp, %rbp
0000000100000f24 5d popq %rbp
0000000100000f25 c3 retq
0000000100000f26 662e0f1f840000000000 nopw %cs:(%rax,%rax)
_main:
0000000100000f30 55 pushq %rbp
0000000100000f31 4889e5 movq %rsp, %rbp
0000000100000f34 4883ec10 subq $0x10, %rsp
0000000100000f38 488d3d4b000000 leaq 0x4b(%rip), %rdi ## literal pool for: "Helloooo!\n"
0000000100000f3f c745fc00000000 movl $0x0, -0x4(%rbp)
0000000100000f46 b000 movb $0x0, %al
0000000100000f48 e81b000000 callq 0x100000f68 ## symbol stub for: _printf
0000000100000f4d bf06000000 movl $0x6, %edi
0000000100000f52 8945f8 movl %eax, -0x8(%rbp)
0000000100000f55 e8b6ffffff callq _foo1
0000000100000f5a e8c1ffffff callq _foo2
0000000100000f5f 31c0 xorl %eax, %eax
0000000100000f61 4883c410 addq $0x10, %rsp
0000000100000f65 5d popq %rbp
0000000100000f66 c3 retq
What are the "nop" instructions found right after the "ret" on functions foo1() and foo2()? They are, of course, never executed since the "ret" instructions return from the function call. Is that any kind of padding or it has a different meaning?
From the Assembly language for x86 processors, Kip R. Irvine
The safest (and the most useless) instruction you can write is called NOP (no operation). It takes up 1 byte of program storage and doesn’t do any work. It is sometimes used by compilers and assemblers to align code to even-address boundaries
00000000 66 8B C3 mov ax,bx
00000003 90 nop ; align next instruction
00000004 8B D1 mov edx,ecx
What are the "nop" instructions found right after the "ret" on functions foo1() and foo2()?
The nop is a no-operation instruction (do nothing), from the linked Wikipedia page (emphasis mine)
A NOP is most commonly used for timing purposes, to force memory alignment, to prevent hazards, to occupy a branch delay slot, to render void an existing instruction such as a jump, or as a place-holder to be replaced by active instructions later on in program development (or to replace removed instructions when refactoring would be problematic or time-consuming).
nop is short for No Operation. The nop instructions in this case are providing execution code alignment. Notice that labels are on 16 byte boundaries. On OSX, the linker (ld) should have a -segalign option that will affect this behavior.
I'm analyzing the disassembly of the following (very simple) C program in GDB on X86_64.
int main()
{
int a = 5;
int b = a + 6;
return 0;
}
I understand that in X86_64 the stack grows down. That is the top of the stack has a lower address than the bottom of the stack. The assembler from the above program is as follows:
Dump of assembler code for function main:
0x0000000000400474 <+0>: push %rbp
0x0000000000400475 <+1>: mov %rsp,%rbp
0x0000000000400478 <+4>: movl $0x5,-0x8(%rbp)
0x000000000040047f <+11>: mov -0x8(%rbp),%eax
0x0000000000400482 <+14>: add $0x6,%eax
0x0000000000400485 <+17>: mov %eax,-0x4(%rbp)
0x0000000000400488 <+20>: mov $0x0,%eax
0x000000000040048d <+25>: leaveq
0x000000000040048e <+26>: retq
End of assembler dump.
I understand that:
We push the base pointer on the stack.
We then copy the value of the stack pointer to the base pointer.
We then copy the value 5 into the address -0x8(%rbp). Since in an int is 4 bytes shouldn't this be at next address in the stack which is -0x4(%rbp) rather than -0x8(%rbp)?.
We then copy the value at the variable a into %eax, add 6 and then copy the value into the address at -0x4(%rbp).
Using the this graphic for reference:
(source: thegreenplace.net)
it looks like the stack has the following contents:
|--------------|
| rbp | <-- %rbp
| 11 | <-- -0x4(%rbp)
| 5 | <-- -0x8(%rbp)
when I was expecting this:
|--------------|
| rbp | <-- %rbp
| 5 | <-- -0x4(%rbp)
| 11 | <-- -0x8(%rbp)
which seems to be the case in 7-understanding-c-by-learning-assembly where they show the assembly:
(gdb) disassemble
Dump of assembler code for function main:
0x0000000100000f50 <main+0>: push %rbp
0x0000000100000f51 <main+1>: mov %rsp,%rbp
0x0000000100000f54 <main+4>: mov $0x0,%eax
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx
0x0000000100000f6a <main+26>: add $0x6,%ecx
0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)
0x0000000100000f73 <main+35>: pop %rbp
0x0000000100000f74 <main+36>: retq
End of assembler dump.
Why is the value of b is being put into a higher memory address in the stack than a when a is clearly declared and initialized first?
The value of b is put on the stack wherever the compiler feels like it. You have no influence over it. And you shouldn't. It's possible that the order will change between minor versions of the compiler because some internal data structure was changed or some code rearranged. Some compilers will even randomize the layout of the stack on different compilations on purpose because it can make certain bugs harder to exploit.
In fact, the compiler might not use the stack at all. There's no need to. Here's the disassembly of the same program compiled with some optimizations enabled:
$ cat > foo.c
int main()
{
int a = 5;
int b = a + 6;
return 0;
}
$ cc -O -c foo.c
$ objdump -S foo.o
foo.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
$
With some simple optimizations the compiler figured out that you don't use the variable 'b', so there's no need to calculate it. And because of that you don't use the variable 'a' either, so there's no need to assign it. Only a compilation with no optimizations (or a very bad compiler) will put anything on the stack here. And even if you use the values basic optimizations will put them into registers because touching the stack is expensive.
I'm compiling the following simple demonstration function:
int add(int a, int b) {
return a + b;
}
Naturally this function would be inlined, but let's assume that it's dynamically linked or not inlined for some other reason. With optimization disabled, the compiler produces the expected code:
00000000 <add>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: 8b 45 0c mov eax,DWORD PTR [ebp+0xc]
6: 03 45 08 add eax,DWORD PTR [ebp+0x8]
9: 5d pop ebp
a: c3 ret
Since there are no function calls inside this function, the instructions at 0, 1 and 9 seemingly have no purpose. Since optimization is disabled, this is acceptable.
However, when compiling while optimizing for size with -Os -s, the exact same code is produced. It seems rather wasteful to increase the size of the function by 66% with these options.
Why is the code not optimized to the following?
00000000 <add>:
0: 8b 45 0c mov eax,DWORD PTR [esp+0x8]
3: 03 45 08 add eax,DWORD PTR [esp+0x4]
6: c3 ret
Does the compiler just not consider this worth optimizing or is it related to other details like function alignment?
This is done to preserve the ability of the debugger to step through your code.
If you really want to disable this try -fomit-frame-pointer.
Compiling your above code using -Os -fomit-frame-pointer -S -masm=intel gave this:
.file "frame.c"
.intel_syntax noprefix
.text
.globl _add
.def _add; .scl 2; .type 32; .endef
_add:
mov eax, DWORD PTR [esp+8]
add eax, DWORD PTR [esp+4]
ret
.ident "GCC: (rev0, Built by MinGW-builds project) 4.8.0"
The value of EBP is not known when the function enters. Code could use mov eax,dword ptr [esp+8] and not bother with the BP register, but many debugging tools assume that each local variable is at a fixed offset relative to some register. Even if a compiler could keep track of things that were pushed on the stack and adjust indexing offsets appropriately, debuggers would likely be unable to do so.
I came across a #define in which they use __builtin_expect.
The documentation says:
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
would indicate that we do not expect to call foo, since we expect x to be zero.
So why not directly use:
if (x)
foo ();
instead of the complicated syntax with __builtin_expect?
Imagine the assembly code that would be generated from:
if (__builtin_expect(x, 0)) {
foo();
...
} else {
bar();
...
}
I guess it should be something like:
cmp $x, 0
jne _foo
_bar:
call bar
...
jmp after_if
_foo:
call foo
...
after_if:
You can see that the instructions are arranged in such an order that the bar case precedes the foo case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.
Before the jump is executed, the instructions below it (the bar case) are pushed to the pipeline. Since the foo case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.
Let's decompile to see what GCC 4.8 does with it
Blagovest mentioned branch inversion to improve the pipeline, but do current compilers really do it? Let's find out!
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 0a jne 1a <main+0x1a>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
The instruction order in memory was unchanged: first the puts and then retq return.
With __builtin_expect
Now replace if (i) with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 07 je 17 <main+0x17>
10: 31 c0 xor %eax,%eax
12: 48 83 c4 08 add $0x8,%rsp
16: c3 retq
17: bf 00 00 00 00 mov $0x0,%edi
18: R_X86_64_32 .rodata.str1.1
1c: e8 00 00 00 00 callq 21 <main+0x21>
1d: R_X86_64_PC32 puts-0x4
21: eb ed jmp 10 <main+0x10>
The puts was moved to the very end of the function, the retq return!
The new code is basically the same as:
int i = !time(NULL);
if (i)
goto puts;
ret:
return 0;
puts:
puts("a");
goto ret;
This optimization was not done with -O0.
But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart those days. My naive attempts are here.
C++20 [[likely]] and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.
The idea of __builtin_expect is to tell the compiler that you'll usually find that the expression evaluates to c, so that the compiler can optimize for that case.
I'd guess that someone thought they were being clever and that they were speeding things up by doing this.
Unfortunately, unless the situation is very well understood (it's likely that they have done no such thing), it may well have made things worse. The documentation even says:
In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
In general, you shouldn't be using __builtin_expect unless:
You have a very real performance issue
You've already optimized the algorithms in the system appropriately
You've got performance data to back up your assertion that a particular case is the most likely
Well, as it says in the description, the first version adds a predictive element to the construction, telling the compiler that the x == 0 branch is the more likely one - that is, it's the branch that will be taken more often by your program.
With that in mind, the compiler can optimize the conditional so that it requires the least amount of work when the expected condition holds, at the expense of maybe having to do more work in case of the unexpected condition.
Take a look at how conditionals are implemented during the compilation phase, and also in the resulting assembly, to see how one branch may be less work than the other.
However, I would only expect this optimization to have noticeable effect if the conditional in question is part of a tight inner loop that gets called a lot, since the difference in the resulting code is relatively small. And if you optimize it the wrong way round, you may well decrease your performance.
I don't see any of the answers addressing the question that I think you were asking, paraphrased:
Is there a more portable way of hinting branch prediction to the compiler.
The title of your question made me think of doing it this way:
if ( !x ) {} else foo();
If the compiler assumes that 'true' is more likely, it could optimize for not calling foo().
The problem here is just that you don't, in general, know what the compiler will assume -- so any code that uses this kind of technique would need to be carefully measured (and possibly monitored over time if the context changes).
I test it on Mac according #Blagovest Buyukliev and #Ciro. The assembles look clear and I add comments;
Commands are
gcc -c -O3 -std=gnu11 testOpt.c; otool -tVI testOpt.o
When I use -O3 , it looks the same no matter the __builtin_expect(i, 0) exist or not.
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp // open function stack
0000000000000004 xorl %edi, %edi // set time args 0 (NULL)
0000000000000006 callq _time // call time(NULL)
000000000000000b testq %rax, %rax // check time(NULL) result
000000000000000e je 0x14 // jump 0x14 if testq result = 0, namely jump to puts
0000000000000010 xorl %eax, %eax // return 0 , return appear first
0000000000000012 popq %rbp // return 0
0000000000000013 retq // return 0
0000000000000014 leaq 0x9(%rip), %rdi ## literal pool for: "a" // puts part, afterwards
000000000000001b callq _puts
0000000000000020 xorl %eax, %eax
0000000000000022 popq %rbp
0000000000000023 retq
When compile with -O2 , it looks different with and without __builtin_expect(i, 0)
First without
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e jne 0x1c // jump to 0x1c if not zero, then return
0000000000000010 leaq 0x9(%rip), %rdi ## literal pool for: "a" // put part appear first , following jne 0x1c
0000000000000017 callq _puts
000000000000001c xorl %eax, %eax // return part appear afterwards
000000000000001e popq %rbp
000000000000001f retq
Now with __builtin_expect(i, 0)
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e je 0x14 // jump to 0x14 if zero then put. otherwise return
0000000000000010 xorl %eax, %eax // return appear first
0000000000000012 popq %rbp
0000000000000013 retq
0000000000000014 leaq 0x7(%rip), %rdi ## literal pool for: "a"
000000000000001b callq _puts
0000000000000020 jmp 0x10
To summarize, __builtin_expect works in the last case.
In most of the cases, you should leave the branch prediction as it is and you do not need to worry about it.
One case where it is beneficial is CPU intensive algorithms with a lot of branching. In some cases, the jumps could lead the to exceed the current CPU program cache making the CPU wait for the next part of the software to run. By pushing the unlikely branches at the end, you will keep your memory close and only jump for unlikely cases.