What is the advantage of GCC's __builtin_expect in if else statements? - c

I came across a #define in which they use __builtin_expect.
The documentation says:
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
would indicate that we do not expect to call foo, since we expect x to be zero.
So why not directly use:
if (x)
foo ();
instead of the complicated syntax with __builtin_expect?

Imagine the assembly code that would be generated from:
if (__builtin_expect(x, 0)) {
foo();
...
} else {
bar();
...
}
I guess it should be something like:
cmp $x, 0
jne _foo
_bar:
call bar
...
jmp after_if
_foo:
call foo
...
after_if:
You can see that the instructions are arranged in such an order that the bar case precedes the foo case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.
Before the jump is executed, the instructions below it (the bar case) are pushed to the pipeline. Since the foo case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.

Let's decompile to see what GCC 4.8 does with it
Blagovest mentioned branch inversion to improve the pipeline, but do current compilers really do it? Let's find out!
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 0a jne 1a <main+0x1a>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
The instruction order in memory was unchanged: first the puts and then retq return.
With __builtin_expect
Now replace if (i) with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 07 je 17 <main+0x17>
10: 31 c0 xor %eax,%eax
12: 48 83 c4 08 add $0x8,%rsp
16: c3 retq
17: bf 00 00 00 00 mov $0x0,%edi
18: R_X86_64_32 .rodata.str1.1
1c: e8 00 00 00 00 callq 21 <main+0x21>
1d: R_X86_64_PC32 puts-0x4
21: eb ed jmp 10 <main+0x10>
The puts was moved to the very end of the function, the retq return!
The new code is basically the same as:
int i = !time(NULL);
if (i)
goto puts;
ret:
return 0;
puts:
puts("a");
goto ret;
This optimization was not done with -O0.
But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart those days. My naive attempts are here.
C++20 [[likely]] and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.

The idea of __builtin_expect is to tell the compiler that you'll usually find that the expression evaluates to c, so that the compiler can optimize for that case.
I'd guess that someone thought they were being clever and that they were speeding things up by doing this.
Unfortunately, unless the situation is very well understood (it's likely that they have done no such thing), it may well have made things worse. The documentation even says:
In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
In general, you shouldn't be using __builtin_expect unless:
You have a very real performance issue
You've already optimized the algorithms in the system appropriately
You've got performance data to back up your assertion that a particular case is the most likely

Well, as it says in the description, the first version adds a predictive element to the construction, telling the compiler that the x == 0 branch is the more likely one - that is, it's the branch that will be taken more often by your program.
With that in mind, the compiler can optimize the conditional so that it requires the least amount of work when the expected condition holds, at the expense of maybe having to do more work in case of the unexpected condition.
Take a look at how conditionals are implemented during the compilation phase, and also in the resulting assembly, to see how one branch may be less work than the other.
However, I would only expect this optimization to have noticeable effect if the conditional in question is part of a tight inner loop that gets called a lot, since the difference in the resulting code is relatively small. And if you optimize it the wrong way round, you may well decrease your performance.

I don't see any of the answers addressing the question that I think you were asking, paraphrased:
Is there a more portable way of hinting branch prediction to the compiler.
The title of your question made me think of doing it this way:
if ( !x ) {} else foo();
If the compiler assumes that 'true' is more likely, it could optimize for not calling foo().
The problem here is just that you don't, in general, know what the compiler will assume -- so any code that uses this kind of technique would need to be carefully measured (and possibly monitored over time if the context changes).

I test it on Mac according #Blagovest Buyukliev and #Ciro. The assembles look clear and I add comments;
Commands are
gcc -c -O3 -std=gnu11 testOpt.c; otool -tVI testOpt.o
When I use -O3 , it looks the same no matter the __builtin_expect(i, 0) exist or not.
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp // open function stack
0000000000000004 xorl %edi, %edi // set time args 0 (NULL)
0000000000000006 callq _time // call time(NULL)
000000000000000b testq %rax, %rax // check time(NULL) result
000000000000000e je 0x14 // jump 0x14 if testq result = 0, namely jump to puts
0000000000000010 xorl %eax, %eax // return 0 , return appear first
0000000000000012 popq %rbp // return 0
0000000000000013 retq // return 0
0000000000000014 leaq 0x9(%rip), %rdi ## literal pool for: "a" // puts part, afterwards
000000000000001b callq _puts
0000000000000020 xorl %eax, %eax
0000000000000022 popq %rbp
0000000000000023 retq
When compile with -O2 , it looks different with and without __builtin_expect(i, 0)
First without
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e jne 0x1c // jump to 0x1c if not zero, then return
0000000000000010 leaq 0x9(%rip), %rdi ## literal pool for: "a" // put part appear first , following jne 0x1c
0000000000000017 callq _puts
000000000000001c xorl %eax, %eax // return part appear afterwards
000000000000001e popq %rbp
000000000000001f retq
Now with __builtin_expect(i, 0)
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e je 0x14 // jump to 0x14 if zero then put. otherwise return
0000000000000010 xorl %eax, %eax // return appear first
0000000000000012 popq %rbp
0000000000000013 retq
0000000000000014 leaq 0x7(%rip), %rdi ## literal pool for: "a"
000000000000001b callq _puts
0000000000000020 jmp 0x10
To summarize, __builtin_expect works in the last case.

In most of the cases, you should leave the branch prediction as it is and you do not need to worry about it.
One case where it is beneficial is CPU intensive algorithms with a lot of branching. In some cases, the jumps could lead the to exceed the current CPU program cache making the CPU wait for the next part of the software to run. By pushing the unlikely branches at the end, you will keep your memory close and only jump for unlikely cases.

Related

How would you explain this disassembly listing?

I have a simple function in C language, in separate file string.c:
void var_init(){
char *hello = "Hello";
}
compiled with:
gcc -ffreestanding -c string.c -o string.o
And then I use command
objdump -d string.o
to see disassemble listing. What I got is:
string.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <var_init>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # b <var_init+0xb>
b: 48 89 45 f8 mov %rax,-0x8(%rbp)
f: 90 nop
10: 5d pop %rbp
11: c3 retq
I lost in understanding this listing. The book "Writing OS from scratch" says something about old disassembly and slightly uncover the mistery, but their listing is completely different and I even not see that data interpreted as code in mine as author says.
In addition to the explaination from #VladfromMoscow, Just thought it might be helpful for the poster to see what happens when you compile to assembly rather than using objdump to see it, as the data can be seen more plainly then (IMO) and the RIP relative addressing may make a bit more sense.
gcc -S x.s
Yields
.file "x.c"
.text
.section .rodata
.LC0:
.string "Hello"
.text
.globl var_init
.type var_init, #function
var_init:
.LFB0:
pushq %rbp
movq %rsp, %rbp
leaq .LC0(%rip), %rax
movq %rax, -8(%rbp)
nop
popq %rbp
ret
.LFE0:
.size var_init, .-var_init
.ident "GCC: (Alpine 8.3.0) 8.3.0"
.section .note.GNU-stack,"",#progbits
This command
lea 0x0(%rip),%rax
stores the address of the string literal in the register rax.
And this command
mov %rax,-0x8(%rbp)
copies the address from the register rax into the allocated stack memory. The address occupies 8 bytes as it is seen from the offset in the stack -0x8.
This store only happens at all because you compiled in debug mode; it would normally be optimized away. The next thing that happens is that the local vars (in the red-zone below the stack pointer) are effectively discarded as the function tears down its stack frame and returns.
The material you're looking at probably included a sub $16, %rsp or similar to allocate space for locals below RBP, then deallocating that space later; the x86-64 System V ABI doesn't need that in leaf functions (that don't call any other functions); they can just use the read zone. (See also Where exactly is the red zone on x86-64?). Or compile with gcc -mno-red-zone, which you probably want anyway for freestanding code: Why can't kernel code use a Red Zone
Then it restores the saved value of the caller's RBP (which was earlier set up as a frame pointer; notice that space for locals was addressed relative to RBP).
pop %rbp
and exits, effectively popping the return address into RIP
retq

What are the addresses on the left in the output of objdump on a binary file?

I have compiled a simple hello world c code with gcc -fpie test.c, and now looking at the binary using objdump:
Disassembly of section __TEXT,__text:
__text:
100000f40: 55 pushq %rbp
100000f41: 48 89 e5 movq %rsp, %rbp
100000f44: 48 83 ec 10 subq $16, %rsp
100000f48: 89 7d fc movl %edi, -4(%rbp)
100000f4b: 8b 75 fc movl -4(%rbp), %esi
100000f4e: 48 8d 3d 5d 00 00 00 leaq 93(%rip), %rdi
100000f55: b0 00 movb $0, %al
...
Are these virtual (runtime) addresses considering I have compiled with -fpie? what are they used for if code is position independent.
If I remove the fpie I do get the same addresses on the left, and I'm assuming they are virtual addresses where these instructions will be loaded to correct?
In a PIE (Position Independent Executable), those "addresses" are actually just relative offsets from the base virtual address of your program. When the program is launched, it will be loaded into memory by the kernel loader at some 0x<base_addr>, and your __text section in this case will be at 0x<base_addr> + 0x100000f40.
Note that the base virtual address will change for every execution if you have ASLR (address space layout randomization) enabled, which on any modern system is enabled by default.

does gcc preserve callee save registers

As you might have guessed the question is does gcc automaticly saves callee-save registers or should I do it by myself? I thought that gcc would do that for me but when I wrote this code
void foo(void) {
__asm__ volatile ("mov $123, %rbx");
}
void main(void) {
foo();
}
after gcc a.c && objdump -d a.out I saw this
00000000004004f6 <foo>:
4004f6: 55 push %rbp
4004f7: 48 89 e5 mov %rsp,%rbp
4004fa: 48 c7 c3 7b 00 00 00 mov $0x7b,%rbx
400501: 90 nop
400502: 5d pop %rbp
400503: c3 retq
0000000000400504 <main>:
400504: 55 push %rbp
400505: 48 89 e5 mov %rsp,%rbp
400508: e8 e9 ff ff ff callq 4004f6 <foo>
40050d: 90 nop
40050e: 5d pop %rbp
40050f: c3 retq
According to x86-64 ABI %rbx is a callee-save register but in this code gcc did not save it in foo before modifying. Is it just because I don't use %rbx in main function after calling foo() or it's bacause gcc does not provide such garanties and I have to save it by myself in foo before modifying?
Gcc will automatically save and restore all callee-save registers THAT IT KNOWS ARE USED. It knows about registers it uses itself, but it will only know about registers used in inline assembly if you tell it. That's what the 'clobbers' list is for:
void foo(void) {
__asm__ volatile ("mov $123, %%rbx" : : : "rbx");
}
Now the compiler knows that you are using/modifying rbx, so it will save it if it needs to.
Note that you really want to do it this way rather than trying to save it yourself, as this way it will only be saved once if gcc also wants to use the register for something in this function.
It would be pretty boggy code that saves and restores every register by rote. The compiler saves registers within the C code it has compiled, but you are on your own here, gcc has no idea what your intentions are.
The assembler allows you to get under the hood, but it won't replace the spark plugs for you.

Confusion with system call

I am trying to understand how a system call is made in x86. I am reading Smashing the stack for fun and profit. There is a function given on page 7:
#include <stdio.h>
void main() {
char *name[2];
name[0] = "/bin/sh";
name[1] = NULL;
execve(name[0], name, NULL);
}
and below the function is given its assembly dump:
Dump of assembler code for function main:
0x8000130 : pushl %ebp
0x8000131 : movl %esp,%ebp
0x8000133 : subl $0x8,%esp
0x8000136 : movl $0x80027b8,0xfffffff8(%ebp)
0x800013d : movl $0x0,0xfffffffc(%ebp)
0x8000144 : pushl $0x0
0x8000146 : leal 0xfffffff8(%ebp),%eax
0x8000149 : pushl %eax
0x800014a : movl 0xfffffff8(%ebp),%eax
0x800014d : pushl %eax
0x800014e : call 0x80002bc <__execve>
0x8000153 : addl $0xc,%esp
0x8000156 : movl %ebp,%esp
0x8000158 : popl %ebp
0x8000159 : ret
Dump of assembler code for function __execve:
0x80002bc <__execve>: pushl %ebp
0x80002bd <__execve+1>: movl %esp,%ebp
0x80002bf <__execve+3>: pushl %ebx
0x80002c0 <__execve+4>: movl $0xb,%eax
0x80002c5 <__execve+9>: movl 0x8(%ebp),%ebx
0x80002c8 <__execve+12>: movl 0xc(%ebp),%ecx
0x80002cb <__execve+15>: movl 0x10(%ebp),%edx
0x80002ce <__execve+18>: int $0x80
0x80002d0 <__execve+20>: movl %eax,%edx
0x80002d2 <__execve+22>: testl %edx,%edx
0x80002d4 <__execve+24>: jnl 0x80002e6 <__execve+42>
0x80002d6 <__execve+26>: negl %edx
0x80002d8 <__execve+28>: pushl %edx
0x80002d9 <__execve+29>: call 0x8001a34 <__normal_errno_location>
0x80002de <__execve+34>: popl %edx
0x80002df <__execve+35>: movl %edx,(%eax)
0x80002e1 <__execve+37>: movl $0xffffffff,%eax
0x80002e6 <__execve+42>: popl %ebx
0x80002e7 <__execve+43>: movl %ebp,%esp
0x80002e9 <__execve+45>: popl %ebp
0x80002ea <__execve+46>: ret
0x80002eb <__execve+47>: nop
However on writing the same code on my machine and compiling with
gcc test.c -m32 -g -o test -fno-stack-protector -static
and generating the dump with
objdump -S test > test.dis
I get the following dump for main:
void main(){
8048e24: 55 push %ebp
8048e25: 89 e5 mov %esp,%ebp
8048e27: 83 e4 f0 and $0xfffffff0,%esp
8048e2a: 83 ec 20 sub $0x20,%esp
char *name[2];
name[0] = "/bin/sh";
8048e2d: c7 44 24 18 e8 de 0b movl $0x80bdee8,0x18(%esp)
8048e34: 08
name[1] = NULL;
8048e35: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
8048e3c: 00
execve(name[0], name, NULL);
8048e3d: 8b 44 24 18 mov 0x18(%esp),%eax
8048e41: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
8048e48: 00
8048e49: 8d 54 24 18 lea 0x18(%esp),%edx
8048e4d: 89 54 24 04 mov %edx,0x4(%esp)
8048e51: 89 04 24 mov %eax,(%esp)
8048e54: e8 17 34 02 00 call 806c270 <__execve>
}
And for __execve:
0806c270 <__execve>:
806c270: 53 push %ebx
806c271: 8b 54 24 10 mov 0x10(%esp),%edx
806c275: 8b 4c 24 0c mov 0xc(%esp),%ecx
806c279: 8b 5c 24 08 mov 0x8(%esp),%ebx
806c27d: b8 0b 00 00 00 mov $0xb,%eax
806c282: ff 15 f0 99 0e 08 call *0x80e99f0
806c288: 3d 00 f0 ff ff cmp $0xfffff000,%eax
806c28d: 77 02 ja 806c291 <__execve+0x21>
806c28f: 5b pop %ebx
806c290: c3 ret
806c291: c7 c2 e8 ff ff ff mov $0xffffffe8,%edx
806c297: f7 d8 neg %eax
806c299: 65 89 02 mov %eax,%gs:(%edx)
806c29c: 83 c8 ff or $0xffffffff,%eax
806c29f: 5b pop %ebx
806c2a0: c3 ret
806c2a1: 66 90 xchg %ax,%ax
806c2a3: 66 90 xchg %ax,%ax
806c2a5: 66 90 xchg %ax,%ax
806c2a7: 66 90 xchg %ax,%ax
806c2a9: 66 90 xchg %ax,%ax
806c2ab: 66 90 xchg %ax,%ax
806c2ad: 66 90 xchg %ax,%ax
806c2af: 90 nop
I understand that the article is very old so it may not match exactly with the current standards. In fact i am able make sense of most of the differences. Here is what is bothering me:
From what I know: to make the exec system call I need to put the arguments in specific registers and call the instruction
int 0x80
to send an interrupt. I can see this instruction at address 0x80002ce in the dump given in the article. But I cannot find the same instruction in mine. In place of it I find
call *0x80e99f0
and the address 0x80e99f0 doesn't even exists in my dump. What am I missing here? What is the point of a * before 0x80e99f0. Is the address 0x80e99f0 being dynamically loaded at runtime? If it is true then what is the use of -static flag during compilation and what can I do to make the dump similar to that of the article?
I am running 64 bit ubuntu 14.04 on Intel processor
Edit after getting suggestion to run objdump with -DS flag:
I finally get the hidden address:
080e99f0 <_dl_sysinfo>:
80e99f0: 70 ed jo 80e99df <_dl_load_lock+0x7>
80e99f2: 06 push %es
80e99f3: 08 b0 a6 09 08 07 or %dh,0x70809a6(%eax)
but still can't make any sense.
The address in jo 80e99df points again to something that is hidden in between these lines:
080e99d8 <_dl_load_lock>:
...
80e99e4: 01 00 add %eax,(%eax)
...
As evident from the answer the code actually jumps to the address present in memory location 0x80e99f0 which eventually points to int $0x80 instruction.
Traditionally, Linux used interrupt 0x80 to invoke system calls. Since the PentiumPro, there is an alternative way to invoke a system call: using the SYSENTER instruction (AMD also has its own SYSCALL instruction). This is a more efficient way to invoke a system call.
Choosing which syscall mechanism to use
The linux kernel and glibc have a mechanism to choose between the different ways to invoke a system call.
The kernel sets up a virtual shared library for each process, it's called the VDSO (virtual dynamic shared object), which you can see in the output of cat /proc/<pid>/maps:
$ cat /proc/self/maps
08048000-0804c000 r-xp 00000000 03:04 1553592 /bin/cat
0804c000-0804d000 rw-p 00003000 03:04 1553592 /bin/cat
[...]
b7ee8000-b7ee9000 r-xp b7ee8000 00:00 0 [vdso]
[...]
This vdso, among other things, contains an appropriate system call invocation sequence for the CPU in use, e.g:
ffffe414 <__kernel_vsyscall>:
ffffe414: 51 push %ecx ; \
ffffe415: 52 push %edx ; > save registers
ffffe416: 55 push %ebp ; /
ffffe417: 89 e5 mov %esp,%ebp ; save stack pointer
ffffe419: 0f 34 sysenter ; invoke system call
ffffe41b: 90 nop
ffffe41c: 90 nop ; the kernel will usually
ffffe41d: 90 nop ; return to the insn just
ffffe41e: 90 nop ; past the jmp, but if the
ffffe41f: 90 nop ; system call was interrupted
ffffe420: 90 nop ; and needs to be restarted
ffffe421: 90 nop ; it will return to this jmp
ffffe422: eb f3 jmp ffffe417 <__kernel_vsyscall+0x3>
ffffe424: 5d pop %ebp ; \
ffffe425: 5a pop %edx ; > restore registers
ffffe426: 59 pop %ecx ; /
ffffe427: c3 ret ; return to caller
In arch/x86/vdso/vdso32/ there are implementations using int 0x80, sysenter and syscall, the kernel selects the appropriate one.
To let userspace know that there is a vdso, and where it is located, the kernel sets AT_SYSINFO and AT_SYSINFO_EHDR entries in the auxiliary vector (auxv, the 4th argument to main(), after argc, argv, envp, which is used to pass some information from the kernel to newly started processes). AT_SYSINFO_EHDR points to the ELF header of the vdso, AT_SYSINFO points to the vsyscall implementation:
$ LD_SHOW_AUXV=1 id # tell the dynamic linker ld.so to output auxv values
AT_SYSINFO: 0xb7fd4414
AT_SYSINFO_EHDR: 0xb7fd4000
[...]
glibc uses this information to locate the vsyscall. It stores it into the dynamic loader global _dl_sysinfo, e.g.:
glibc-2.16.0/elf/dl-support.c:_dl_aux_init():
ifdef NEED_DL_SYSINFO
case AT_SYSINFO:
GL(dl_sysinfo) = av->a_un.a_val;
break;
#endif
#if defined NEED_DL_SYSINFO || defined NEED_DL_SYSINFO_DSO
case AT_SYSINFO_EHDR:
GL(dl_sysinfo_dso) = (void *) av->a_un.a_val;
break;
#endif
glibc-2.16.0/elf/dl-sysdep.c:_dl_sysdep_start()
glibc-2.16.0/elf/rtld.c:dl_main:
GLRO(dl_sysinfo) = GLRO(dl_sysinfo_dso)->e_entry + l->l_addr;
and in a field in the header of the TCB (thread control block):
glibc-2.16.0/nptl/sysdeps/i386/tls.h
_head->sysinfo = GLRO(dl_sysinfo)
If the kernel is old and doesn't provide a vdso, glibc provides a default implementation for _dl_sysinfo:
.hidden _dl_sysinfo_int80:
int $0x80
ret
When a program is compiled against glibc, depending on circumstances, a choice is made between different ways of invoking a system call:
glibc-2.16.0/sysdeps/unix/sysv/linux/i386/sysdep.h:
/* The original calling convention for system calls on Linux/i386 is
to use int $0x80. */
#ifdef I386_USE_SYSENTER
# ifdef SHARED
# define ENTER_KERNEL call *%gs:SYSINFO_OFFSET
# else
# define ENTER_KERNEL call *_dl_sysinfo
# endif
#else
# define ENTER_KERNEL int $0x80
#endif
int 0x80 ← the traditional way
call *%gs:offsetof(tcb_head_t, sysinfo) ← %gs points to the TCB, so this jumps indirectly through the pointer to vsyscall stored in the TCB
call *_dl_sysinfo ← this jumps indirectly through the global variable
So, in x86:
system call
↓
int 0x80 / call *%gs:0x10 / call *_dl_sysinfo
│ │
╰─┬──────────┼─────────╮
↓ ↓ ↓
(in vdso) int 0x80 / sysenter / syscall
Try to use objdump -DS or objdump -sS to include the address 0x80e99f0 in your dump.
Local example:
0806bf70 <__execve>:
...
806bf82: ff 15 10 a3 0e 08 call *0x80ea310
At address 0x80ea310 (shown with objdump -sS):
80ea310 10ea0608 60a60908 07000000 7f030000
10ea0608 is address 0x806ea10 little-endian in memory.
You will then see, that the address of _dl_sysinfo_int80 is located there:
0806ea10 <_dl_sysinfo_int80>:
806ea10: cd 80 int $0x80
806ea12: c3 ret
which calls the software interrupt 0x80 (executes the syscall) and returns to the caller then.
call *0x80ea310 is therefore really calling 0x806ea10 (dereferencing a pointer)

Understanding assembly .long directive

In Secure programming cookbook for C and C++ from John Viega I met the following statement
asm("value_stored: \n"
".long 0xFFFFFFFF \n"
);
I do not really understand the use of .long directive in assembly, but here it is used to embed a precalculated value in the executable. Can I somehow force the position of these bytes in the executable? I have tried to put it at the end of main (thinking that this way will be at the end of .text section), but I got segmentation fault. Putting it outside the main works.
Even at the end of main the inline assembler sequence will generate code to be executed. In my environment objdump -d foo.o shows:
00000000004004b4 <main>:
4004b4: 55 push %rbp
4004b5: 48 89 e5 mov %rsp,%rbp
00000000004004b8 <value>:
4004b8: ff (bad)
4004b9: ff (bad)
4004ba: ff (bad)
4004bb: ff (bad)
4004bc: b8 01 00 00 00 mov $0x1,%eax
4004c1: 5d pop %rbp
4004c2: c3 retq
This can be mitigated by jumping over it
asm("jmp 1f"
"value: .long 0xffffffff"
"1:");
Keywords Nf or Nb create local temporary labels to jump forward or backwards.
Another option will be to place the variable to a named segment, which can be sorted in the linker file as the last segment in either .text or .data.

Resources