Builtins in Clang not so builtin? - c

If I have the following in strlen.c:
int call_strlen(char *s) {
return __builtin_strlen(s);
}
And then compile it with both gcc and clang like this:
gcc -c -o strlen-gcc.o strlen.c
clang -c -o strlen-clang.o strlen.c
I am surprised to see that strlen-clang.o contains a reference to "strlen", whereas gcc has expectedly inlined the function and has no such reference. (see objdumps below). Is this a bug in clang? I have tested it in several versions of the clang compiler, including 3.8.
Edit: the reason this is important for me is that I'm linking with -nostdlib, and the clang-compiled version gives me a link error that strlen is not found.
Clang
#> objdump -d strlen-clang.o
strlen-clang.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <call_strlen>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: 48 89 7d f8 mov %rdi,-0x8(%rbp)
c: 48 8b 7d f8 mov -0x8(%rbp),%rdi
10: e8 00 00 00 00 callq 15 <call_strlen+0x15>
15: 89 c1 mov %eax,%ecx
17: 89 c8 mov %ecx,%eax
19: 48 83 c4 10 add $0x10,%rsp
1d: 5d pop %rbp
1e: c3 retq
#> objdump -t strlen-clang.o
strlen-clang.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 strlen.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .eh_frame 0000000000000000 .eh_frame
0000000000000000 g F .text 000000000000001f call_strlen
0000000000000000 *UND* 0000000000000000 strlen
GCC
#> objdump -d strlen-gcc.o
strlen-gcc.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <call_strlen>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 89 7d f8 mov %rdi,-0x8(%rbp)
8: 48 8b 45 f8 mov -0x8(%rbp),%rax
c: 48 c7 c1 ff ff ff ff mov $0xffffffffffffffff,%rcx
13: 48 89 c2 mov %rax,%rdx
16: b8 00 00 00 00 mov $0x0,%eax
1b: 48 89 d7 mov %rdx,%rdi
1e: f2 ae repnz scas %es:(%rdi),%al
20: 48 89 c8 mov %rcx,%rax
23: 48 f7 d0 not %rax
26: 48 83 e8 01 sub $0x1,%rax
2a: 5d pop %rbp
2b: c3 retq
#> objdump -t strlen-gcc.o
strlen-gcc.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 strlen.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .eh_frame 0000000000000000 .eh_frame
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 g F .text 000000000000002c call_strlen

Just to get optimisation out of the way:
With clang -O0:
t.o:
(__TEXT,__text) section
_call_strlen:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 subq $0x10, %rsp
0000000000000008 movq %rdi, -0x8(%rbp)
000000000000000c movq -0x8(%rbp), %rdi
0000000000000010 callq _strlen
0000000000000015 movl %eax, %ecx
0000000000000017 movl %ecx, %eax
0000000000000019 addq $0x10, %rsp
000000000000001d popq %rbp
000000000000001e retq
With clang -O3
t.o:
(__TEXT,__text) section
_call_strlen:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 popq %rbp
0000000000000005 jmp _strlen
Now, onto the problem:
The clang documentation claims that clang support all GCC-supported builtins.
However, the GCC documentation seems to treat builtin functions and the names of their library equivalents as synonyms:
Both forms have the same type (including prototype), the same address (when their address is taken), and the same meaning as the C library functions [...].
Also it does not guarantee a builtin function with a library equivalent (as is the case with strlen) to indeed get optimised:
Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function is emitted.
Further, the clang internals manual mentions __builtin_strlen only once:
__builtin_strlen and strlen: These are constant folded as integer constant expressions if the argument is a string literal.
Other than that they seem to make no promises.
Since in your case the argument to __builtin_strlen is not a string literal, and since the GCC documentation allows for calls to builtin functions to be converted to library function calls, clang's behaviour seems perfectly valid.
A "patch for review" on the clang developers mailing list also says:
[...] It will still fall back to runtime use of library strlen, if
compile-time evaluation is not possible/required [...].
That was in 2012, but the text indicates that at least back then, only compile-time evaluation was supported.
Now, I see two options:
If you only need to compile the program yourself and then use and/or distribute it, I suggest you simply use gcc.
If you need others to be able to compile your code under both gcc and clang, I suggest adding a C library as a dependency for static linking.
I strongly advise against rolling your own implementations of standard library functions, even for seemingly simple cases (if you disagree, try writing up your own strlen implementation, then compare it to the glibc one).

Neither GCC nor Clang promises to inline this builtin. You quoted some GCC documentation seeming to make such a promise:
...GCC built-in functions are always expanded inline...
but this is a sentence fragment pulled out of context. The complete sentence is
With the exception of built-ins that have library equivalents such as the standard C library functions discussed below, or that expand to library calls, GCC built-in functions are always expanded inline and thus do not have corresponding entry points and their address cannot be obtained.
__builtin_strlen has the library equivalent strlen, so this sentence makes no promises about whether it gets inlined.

Related

Why does Apple Clang leave redundant stack push pop instructions under -O1/2/3?

Given a simple function
int add(int a, int b) {
return a + b;
}
Compile it with clang -O3 -c -o test.o test.c. The compiler version is
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
The disassembly of the object file shows
test.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
0000000000000000 _add:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 8d 04 37 leal (%rdi,%rsi), %eax
7: 5d popq %rbp
8: c3 retq
Obviously the pushq, movq and popq instructions do nothing than wasting CPU time.
Compiling the same piece of code on Linux with clang version 7.0.1-8 (tags/RELEASE_701/final)
Target: x86_64-pc-linux-gnu yields the truly optimized instructions below.
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <add>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 retq
Is there anything wrong with Apple Clang?
A related question is here: Apple clang -O1 not optimizing enough?
But the answer there does not address my question.

Loading a dynamic library at run-time yields inconsistent and unexpected results, missing symbols and empty PLT entries. Why?

I've been fighting with this problem for quite some time, and I've been unable to find a solution or even an explanation for it. So sorry if the question is long, but bear with me as I just want to make it 100% clear in the hopes that someone more experienced than me will be able to figure it out.
I'm keeping the C syntax highlight on for all snippets because it makes them a little bit clearer even if not really correct.
What I want to do
I have a C program which uses some functions from a dynamic library (libzip). Here it is boiled down to a minimal reproducible example (it basically does nothing, but it works just fine):
#include <zip.h>
int main(void) {
int err;
zip_t *myzip;
myzip = zip_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
if (myzip == NULL)
return 1;
zip_close(myzip);
return 0;
}
Normally, to compile it, I would simply do:
gcc -c prog.c
gcc -o prog prog.o -lzip
This creates, as expected, an ELF which requires libzip to run:
$ ldd prog
linux-vdso.so.1 (0x00007ffdafb53000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f81eedc7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f81ef780000)
libzip.so.4 => /usr/lib/x86_64-linux-gnu/libzip.so.4 (0x00007f81ef166000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f81eebad000)
(libz is just a dependency of libzip)
What I really want to do though, is to load the library myself using dlopen(). Pretty simple task, no? Well yes, or at least I thought.
To achieve this, I should just need to call dlopen and let the loader do its job:
#include <zip.h>
#include <dlfcn.h>
int main(void) {
void *lib;
int err;
zip_t *myzip;
lib = dlopen("libzip.so", RTLD_LAZY | RTLD_GLOBAL);
if (lib == NULL)
return 1;
myzip = zip_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
if (myzip == NULL)
return 1;
zip_close(myzip);
return 0;
}
Of course, since I want to manually load the library myself, I will not link it this time:
# Create prog.o
gcc -c prog.c
# Do a dry-run just to make sure all symbols are resolved
gcc -o /dev/null prog.o -ldl -lzip
# Now recompile only with libdl
gcc -o prog prog.o -ldl -Wl,--unresolved-symbols=ignore-in-object-files
The flag --unresolved-symbols=ignore-in-object-files tells ld to not worry about my prog.o having unresolved symbols at link time (I want to take care of that myself at runtime).
The problem
The above Should Just Work™, and indeed it does seem to... but I have two machines, and being the pedantic nerd I am I just thought "well, better make sure and compile it on both of them".
First machine
x86-64, Linux 4.9, Debian 9, gcc 6.3.0, ld 2.28. Here everything works as expected.
I can clearly see that the symbols are there:
$ readelf --dyn-syms prog
Symbol table '.dynsym' contains 15 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
2: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main#GLIBC_2.2.5 (2)
3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
===> 4: 0000000000000000 0 FUNC GLOBAL DEFAULT UND zip_close
5: 0000000000000000 0 FUNC GLOBAL DEFAULT UND dlopen#GLIBC_2.2.5 (3)
===> 6: 0000000000000000 0 FUNC GLOBAL DEFAULT UND zip_open
7: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _Jv_RegisterClasses
8: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable
9: 0000000000000000 0 FUNC WEAK DEFAULT UND __cxa_finalize#GLIBC_2.2.5 (2)
10: 0000000000201040 0 NOTYPE GLOBAL DEFAULT 25 _edata
11: 0000000000201048 0 NOTYPE GLOBAL DEFAULT 26 _end
12: 0000000000201040 0 NOTYPE GLOBAL DEFAULT 26 __bss_start
13: 00000000000006a0 0 FUNC GLOBAL DEFAULT 11 _init
14: 0000000000000924 0 FUNC GLOBAL DEFAULT 15 _fini
The PLT entries are also there as expected and look fine:
$ objdump -j .plt -M intel -d prog
Disassembly of section .plt:
00000000000006c0 <.plt>:
6c0: ff 35 42 09 20 00 push QWORD PTR [rip+0x200942] # 201008 <_GLOBAL_OFFSET_TABLE_+0x8>
6c6: ff 25 44 09 20 00 jmp QWORD PTR [rip+0x200944] # 201010 <_GLOBAL_OFFSET_TABLE_+0x10>
6cc: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
00000000000006d0 <zip_close#plt>:
6d0: ff 25 42 09 20 00 jmp QWORD PTR [rip+0x200942] # 201018 <zip_close>
6d6: 68 00 00 00 00 push 0x0
6db: e9 e0 ff ff ff jmp 6c0 <.plt>
00000000000006e0 <dlopen#plt>:
6e0: ff 25 3a 09 20 00 jmp QWORD PTR [rip+0x20093a] # 201020 <dlopen#GLIBC_2.2.5>
6e6: 68 01 00 00 00 push 0x1
6eb: e9 d0 ff ff ff jmp 6c0 <.plt>
00000000000006f0 <zip_open#plt>:
6f0: ff 25 32 09 20 00 jmp QWORD PTR [rip+0x200932] # 201028 <zip_open>
6f6: 68 02 00 00 00 push 0x2
6fb: e9 c0 ff ff ff jmp 6c0 <.plt>
And the program runs without any problem:
$ ./prog
$ echo $?
0
Even looking inside it with a debugger I can clearly see the symbols getting correctly resolved like any normal dynamic symbol:
0x55555555479b <main+43> lea rax, [rbp - 0x14]
0x55555555479f <main+47> mov rdx, rax
0x5555555547a2 <main+50> mov esi, 9
0x5555555547a7 <main+55> lea rdi, [rip + 0xc0] <0x7ffff7ffd948>
0x5555555547ae <main+62> call zip_open#plt <0x555555554620>
|
v ### PLT entry:
0x555555554620 <zip_open#plt> jmp qword ptr [rip + 0x200a02] <0x555555755028>
|
v
0x555555554626 <zip_open#plt+6> push 2
0x55555555462b <zip_open#plt+11> jmp 0x5555555545f0
|
v ### PLT stub:
0x5555555545f0 push qword ptr [rip + 0x200a12] <0x555555755008>
0x5555555545f6 jmp qword ptr [rip + 0x200a14] <0x7ffff7def0d0>
|
v ### Symbol gets correctly resolved
0x7ffff7def0d0 <_dl_runtime_resolve_fxsave> push rbx
0x7ffff7def0d1 <_dl_runtime_resolve_fxsave+1> mov rbx, rsp
0x7ffff7def0d4 <_dl_runtime_resolve_fxsave+4> and rsp, 0xfffffffffffffff0
0x7ffff7def0d8 <_dl_runtime_resolve_fxsave+8> sub rsp, 0x240
Second machine
x86-64, Linux 4.15, Ubuntu 18.04, gcc 7.4, ld 2.30. Here, something really strange is going on.
Compilation doesn't yield any warning or error, but I do not see the symbols:
$ readelf --dyn-syms prog
Symbol table '.dynsym' contains 7 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
2: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main#GLIBC_2.2.5 (2)
3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
4: 0000000000000000 0 FUNC GLOBAL DEFAULT UND dlopen#GLIBC_2.2.5 (3)
5: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable
6: 0000000000000000 0 FUNC WEAK DEFAULT UND __cxa_finalize#GLIBC_2.2.5 (2)
The PLT entries are there, but they are filled with zeroes, and aren't even recognized by objdump:
$ objdump -j .plt -M intel -d prog
Disassembly of section .plt:
0000000000000560 <.plt>:
560: ff 35 4a 0a 20 00 push QWORD PTR [rip+0x200a4a] # 200fb0 <_GLOBAL_OFFSET_TABLE_+0x8>
566: ff 25 4c 0a 20 00 jmp QWORD PTR [rip+0x200a4c] # 200fb8 <_GLOBAL_OFFSET_TABLE_+0x10>
56c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
...
# ^^^
# Here, these three dots are actually hiding another 0x10+ bytes filled of 0x0
# zip_close#plt should be here instead...
0000000000000580 <dlopen#plt>:
580: ff 25 42 0a 20 00 jmp QWORD PTR [rip+0x200a42] # 200fc8 <dlopen#GLIBC_2.2.5>
586: 68 00 00 00 00 push 0x0
58b: e9 d0 ff ff ff jmp 560 <.plt>
...
# ^^^
# Here, these three dots are actually hiding another 0x10+ bytes filled of 0x0
# zip_open#plt should be here instead...
When the program is run, dlopen() works fine and loads libzip into memory, but then when zip_open() gets called, it just generates a segmentation fault:
$ ./prog
Segmentation fault (code dumped)
Taking a look with a debugger, the issue is even more obvious (in case it wasn't already obvious enough). The PLT entries filled with zeroes just end up decoding to a bunch of add instructions dereferencing rax, which contains an invalid address and makes the program segfault and die:
0x5555555546e5 <main+43> lea rax, [rbp - 0x14]
0x5555555546e9 <main+47> mov rdx, rax
0x5555555546ec <main+50> mov esi, 9
0x5555555546f1 <main+55> lea rdi, [rip + 0xc6]
0x5555555546f8 <main+62> call dlopen#plt+16 <0x555555554590>
|
v ### Broken PLT enrty (all 0x0, will cause a segfault):
0x555555554590 <dlopen#plt+16> add byte ptr [rax], al
0x555555554592 <dlopen#plt+18> add byte ptr [rax], al
0x555555554594 <dlopen#plt+20> add byte ptr [rax], al
0x555555554596 <dlopen#plt+22> add byte ptr [rax], al
0x555555554598 <dlopen#plt+24> add byte ptr [rax], al
0x55555555459a <dlopen#plt+26> add byte ptr [rax], al
0x55555555459c <dlopen#plt+28> add byte ptr [rax], al
0x55555555459e <dlopen#plt+30> add byte ptr [rax], al
### Next PLT entry...
0x5555555545a0 <__cxa_finalize#plt> jmp qword ptr [rip + 0x200a52] <0x7ffff7823520>
|
v
0x7ffff7823520 <__cxa_finalize> push r15
0x7ffff7823522 <__cxa_finalize+2> push r14
Questions
So, first of all... why is this happening?
I thought that this was supposed to work, isn't it? If not, why? And why only on one of the two machines?
But most importantly: how can I fix this?
For question 3 I want to emphasize that the whole point of this is that I want to load the library myself, without linking it, so please refrain from just commenting that this is bad practice, or whatever else.
The above Should Just Work™, and indeed it does seem to...
No, it should not, and if it appears to, that's more of an accident. In general, using --unresolved-symbols=... is a really bad idea™, and will almost never do what you want.
The solution is trivial: you just need to look up zip_open and zip_close, like so:
int main(void) {
void *lib;
zip_t *p_open(const char *, int, int *);
void *p_close(zip_t*);
int err;
zip_t *myzip;
lib = dlopen("libzip.so", RTLD_LAZY | RTLD_GLOBAL);
if (lib == NULL)
return 1;
p_open = (zip_t(*)(const char *, int, int *))dlsym(lib, "zip_open");
if (p_open == NULL)
return 1;
p_close = (void(*)(zip_t*))dlsym(lib, "zip_close");
if (p_close == NULL)
return 1;
myzip = p_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
if (myzip == NULL)
return 1;
p_close(myzip);
return 0;
}
To add to EmployedRussian's answer, you can achieve what you need with the help of Implib.so tool. It would generate stubs for all library symbols (e.g. zip_open) which would call dlopen/dlsym internally and forward calls from your program to shared library:
$ gcc -c prog.c
$ implib-gen.py path/to/libzip.so
$ gcc -o prog prog.o libzip.tramp.S libzip.init.c -ldl
(note that you no longer need fancy linker flags and linker dry runs).
As a side note what you are trying to do is called delayed loading and is a standard feature of Windows DLLS.

`.comm` directive in GAS not showing in `.o` file?

This is a purely pedagogical question.
I have the following C code, in a file called comm.c:
#include <stdio.h>
int a;
int main(){
printf("%d", a);
return 0;
}
The code prints "0\n".
Compiling with gcc -S, I get the following assembly code:
.file "comm.c"
.text
.comm a,4,4
.section .rodata
.LC0:
.string "%d"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl a(%rip), %eax
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 9.1.0"
.section .note.GNU-stack,"",#progbits
I am confused as to what .comm a,4,4 is doing. According to 7.96 of
the GNU as manual, the .text directive, it assembles what follows into
the end of the .text section. Thus, I would think that the beginning
of the .text section contains four bytes allocated to storing the
contents of a. This appears to not be the case, because if we
disassemble the .o file, we find:
comm.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # a <main+0xa>
a: 89 c6 mov %eax,%esi
c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 13 <main+0x13>
13: b8 00 00 00 00 mov $0x0,%eax
18: e8 00 00 00 00 callq 1d <main+0x1d>
1d: b8 00 00 00 00 mov $0x0,%eax
22: 5d pop %rbp
23: c3 retq
Why aren't there four extra bytes at the beginning of =text=, as is
promised by the .text GAS directive? Of course, that would be stupid, to put data in the text segment.
So I guess my question is: what is .comm doing? Why is it placed under a .text directive?
.comm does not allocate in the .text section, but in the .bss section.
From https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/as_7.html#SEC76 and https://sourceware.org/binutils/docs/as/Comm.html#Comm:
If ld does not see a definition for the symbol--just one or more common symbols--then it will allocate length bytes of uninitialized memory.
It is the linker's job to allocate and map the .comm symbols.
You can see this when you link the program and read the symbols table:
gcc comm.o -o comm
readelf comm -s
The relevant symbols:
Num: Value Size Type Bind Vis Ndx Name
24: 0000000000004030 0 SECTION LOCAL DEFAULT 24
31: 0000000000004030 1 OBJECT LOCAL DEFAULT 24 completed.7392
57: 0000000000004038 0 NOTYPE GLOBAL DEFAULT 24 _end
59: 0000000000004034 4 OBJECT GLOBAL DEFAULT 24 a
60: 0000000000004030 0 NOTYPE GLOBAL DEFAULT 24 __bss_start
__bss_start(0000000000004030) is the start of the .bss section, and _end(0000000000004038) is the end of the executable(and in this case also the end of the .bss section).
As the 4 bytes of a are at addresses 0000000000004034-0000000000004037, a is obviously in the .bss section.
And it does show in the .o, just not where you were looking for.
You can read the symbols in the .o file the same way and something like this will show up:
$ readelf comm.o -s
Symbol table '.symtab' contains 13 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS comm.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1
3: 0000000000000000 0 SECTION LOCAL DEFAULT 3
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4
5: 0000000000000000 0 SECTION LOCAL DEFAULT 5
6: 0000000000000000 0 SECTION LOCAL DEFAULT 7
7: 0000000000000000 0 SECTION LOCAL DEFAULT 8
8: 0000000000000000 0 SECTION LOCAL DEFAULT 6
9: 0000000000000004 4 OBJECT GLOBAL DEFAULT COM a
10: 0000000000000000 36 FUNC GLOBAL DEFAULT 1 main
11: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _GLOBAL_OFFSET_TABLE_
12: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printf

C Standard Library Functions vs. System Calls. Which is `open()`?

I know fopen() is in the C standard library, so that I can definitely call the fopen() function in a C program. What I am confused about is why I can call the open() function as well. open() should be a system call, so it is not a C function in the standard library. As I am successfully able to call the open() function, am I calling a C function or a system call?
EJP's comments to the question and Steve Summit's answer are exactly to the point: open() is both a syscall and a function in the standard C library; fopen() is a function in the standard C library, that sets up a file handle -- a data structure of type FILE that contains additional stuff like optional buffering --, and internally calls open() also.
In the hopes to further understanding, I shall show hello.c, an example Hello world -program written in C for Linux on 64-bit x86 (x86-64 AKA AMD64 architecture), which does not use the standard C library at all.
First, hello.c needs to define some macros with inline assembly for us to be able to call the syscalls. These are very architecture- and operating system dependent, which is why this only works in Linux on x86-64 architecture:
/* Freestanding Hello World example in Linux on x86_64/x86.
* Compile using
* gcc -march=x86-64 -mtune=generic -m64 -ffreestanding -nostdlib -nostartfiles hello.c -o hello
*/
#define STDOUT_FILENO 1
#define EXIT_SUCCESS 0
#ifndef __x86_64__
#error This program only works on x86_64 architecture!
#endif
#define SYS_write 1
#define SYS_exit 60
#define SYSCALL1_NORET(nr, arg1) \
__asm__ ( "syscall\n\t" \
: \
: "a" (nr), "D" (arg1) \
: "rcx", "r11" )
#define SYSCALL3(retval, nr, arg1, arg2, arg3) \
__asm__ ( "syscall\n\t" \
: "=a" (retval) \
: "a" (nr), "D" (arg1), "S" (arg2), "d" (arg3) \
: "rcx", "r11" )
The Freestanding in the comment at the beginning of the file refers to "freestanding execution environment"; it is the case when there is no C library available at all. For example, the Linux kernel is written the same way. The normal environment we are familiar with is called "hosted execution environment", by the way.
Next, we can define two functions, or "wrappers", around the syscalls:
static inline void my_exit(int retval)
{
SYSCALL1_NORET(SYS_exit, retval);
}
static inline int my_write(int fd, const void *data, int len)
{
int retval;
if (fd == -1 || !data || len < 0)
return -1;
SYSCALL3(retval, SYS_write, fd, data, len);
if (retval < 0)
return -1;
return retval;
}
Above, my_exit() is roughly equivalent to C standard library exit() function, and my_write() to write().
The C language does not define any kind of a way to do a syscall, so that is why we always need a "wrapper" function of some sort. (The GNU C library does provide a syscall() function for us to do any syscall we wish -- but the point of this example is to not use the C library at all.)
The wrapper functions always involve a bit of (inline) assembly. Again, since C does not have a built-in way to do a syscall, we need to "extend" the language by adding some assembly code. This (inline) assembly, and the syscall numbers, is what makes this example, operating system and architecture dependent. And yes: the GNU C library, for example, contains the equivalent wrappers for quite a few architectures.
Some of the functions in the C library do not use any syscalls. We also need one, the equivalent of strlen():
static inline int my_strlen(const char *str)
{
int len = 0L;
if (!str)
return -1;
while (*str++)
len++;
return len;
}
Note that there is no NULL used anywhere in the above code. It is because it is a macro defined by the C library. Instead, I'm relying on "logical null": (!pointer) is true if and only if pointer is a zero pointer, which is what NULL is on all architectures in Linux. I could have defined NULL myself, but I didn't, in the hopes that somebody might notice the lack of it.
Finally, main() itself is something the GNU C library calls, as in Linux, the actual start point of the binary is called _start. The _start is provided by the hosted runtime environment, and initializes the C library data structures and does other similar preparations. Our example program is so simple we do not need it, so we can just put our simple main program part into _start instead:
void _start(void)
{
const char *msg = "Hello, world!\n";
my_write(STDOUT_FILENO, msg, my_strlen(msg));
my_exit(EXIT_SUCCESS);
}
If you put all of the above together, and compile it using
gcc -march=x86-64 -mtune=generic -m64 -ffreestanding -nostdlib -nostartfiles hello.c -o hello
per the comment at the start of the file, you will end up with a small (about two kilobytes) static binary, that when run,
./hello
outputs
Hello, world!
You can use file hello to examine the contents of the file. You could run strip hello to remove all (unneeded) symbols, reducing the file size further down to about one and a half kilobytes, if file size was really important. (It will make the object dump less interesting, however, so before you do that, check out the next step first.)
We can use objdump -x hello to examine the sections in the file:
hello: file format elf64-x86-64
hello
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x00000000004001e1
Program Header:
LOAD off 0x0000000000000000 vaddr 0x0000000000400000 paddr 0x0000000000400000 align 2**21
filesz 0x00000000000002f0 memsz 0x00000000000002f0 flags r-x
NOTE off 0x0000000000000120 vaddr 0x0000000000400120 paddr 0x0000000000400120 align 2**2
filesz 0x0000000000000024 memsz 0x0000000000000024 flags r--
EH_FRAME off 0x000000000000022c vaddr 0x000000000040022c paddr 0x000000000040022c align 2**2
filesz 0x000000000000002c memsz 0x000000000000002c flags r--
STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-
Sections:
Idx Name Size VMA LMA File off Algn
0 .note.gnu.build-id 00000024 0000000000400120 0000000000400120 00000120 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 000000d9 0000000000400144 0000000000400144 00000144 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .rodata 0000000f 000000000040021d 000000000040021d 0000021d 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .eh_frame_hdr 0000002c 000000000040022c 000000000040022c 0000022c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .eh_frame 00000098 0000000000400258 0000000000400258 00000258 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .comment 00000034 0000000000000000 0000000000000000 000002f0 2**0
CONTENTS, READONLY
SYMBOL TABLE:
0000000000400120 l d .note.gnu.build-id 0000000000000000 .note.gnu.build-id
0000000000400144 l d .text 0000000000000000 .text
000000000040021d l d .rodata 0000000000000000 .rodata
000000000040022c l d .eh_frame_hdr 0000000000000000 .eh_frame_hdr
0000000000400258 l d .eh_frame 0000000000000000 .eh_frame
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 l df *ABS* 0000000000000000 hello.c
0000000000400144 l F .text 0000000000000016 my_exit
000000000040015a l F .text 000000000000004e my_write
00000000004001a8 l F .text 0000000000000039 my_strlen
0000000000000000 l df *ABS* 0000000000000000
000000000040022c l .eh_frame_hdr 0000000000000000 __GNU_EH_FRAME_HDR
00000000004001e1 g F .text 000000000000003c _start
0000000000601000 g .eh_frame 0000000000000000 __bss_start
0000000000601000 g .eh_frame 0000000000000000 _edata
0000000000601000 g .eh_frame 0000000000000000 _end
The .text section contains our code, and .rodata immutable constants; here, just the Hello, world! string literal. The rest of the sections are stuff the linker adds and the system uses. We can see that we have f(hex) = 15 bytes of read-only data, and d9(hex) = 217 bytes of code; the rest of the file (about a kilobyte or so) is ELF stuff added by the linker for the kernel to use when executing this binary.
We can even examine the actual assembly code contained in hello, by running objdump -d hello:
hello: file format elf64-x86-64
Disassembly of section .text:
0000000000400144 <my_exit>:
400144: 55 push %rbp
400145: 48 89 e5 mov %rsp,%rbp
400148: 89 7d fc mov %edi,-0x4(%rbp)
40014b: b8 3c 00 00 00 mov $0x3c,%eax
400150: 8b 55 fc mov -0x4(%rbp),%edx
400153: 89 d7 mov %edx,%edi
400155: 0f 05 syscall
400157: 90 nop
400158: 5d pop %rbp
400159: c3 retq
000000000040015a <my_write>:
40015a: 55 push %rbp
40015b: 48 89 e5 mov %rsp,%rbp
40015e: 89 7d ec mov %edi,-0x14(%rbp)
400161: 48 89 75 e0 mov %rsi,-0x20(%rbp)
400165: 89 55 e8 mov %edx,-0x18(%rbp)
400168: 83 7d ec ff cmpl $0xffffffff,-0x14(%rbp)
40016c: 74 0d je 40017b <my_write+0x21>
40016e: 48 83 7d e0 00 cmpq $0x0,-0x20(%rbp)
400173: 74 06 je 40017b <my_write+0x21>
400175: 83 7d e8 00 cmpl $0x0,-0x18(%rbp)
400179: 79 07 jns 400182 <my_write+0x28>
40017b: b8 ff ff ff ff mov $0xffffffff,%eax
400180: eb 24 jmp 4001a6 <my_write+0x4c>
400182: b8 01 00 00 00 mov $0x1,%eax
400187: 8b 7d ec mov -0x14(%rbp),%edi
40018a: 48 8b 75 e0 mov -0x20(%rbp),%rsi
40018e: 8b 55 e8 mov -0x18(%rbp),%edx
400191: 0f 05 syscall
400193: 89 45 fc mov %eax,-0x4(%rbp)
400196: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
40019a: 79 07 jns 4001a3 <my_write+0x49>
40019c: b8 ff ff ff ff mov $0xffffffff,%eax
4001a1: eb 03 jmp 4001a6 <my_write+0x4c>
4001a3: 8b 45 fc mov -0x4(%rbp),%eax
4001a6: 5d pop %rbp
4001a7: c3 retq
00000000004001a8 <my_strlen>:
4001a8: 55 push %rbp
4001a9: 48 89 e5 mov %rsp,%rbp
4001ac: 48 89 7d e8 mov %rdi,-0x18(%rbp)
4001b0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4001b7: 48 83 7d e8 00 cmpq $0x0,-0x18(%rbp)
4001bc: 75 0b jne 4001c9 <my_strlen+0x21>
4001be: b8 ff ff ff ff mov $0xffffffff,%eax
4001c3: eb 1a jmp 4001df <my_strlen+0x37>
4001c5: 83 45 fc 01 addl $0x1,-0x4(%rbp)
4001c9: 48 8b 45 e8 mov -0x18(%rbp),%rax
4001cd: 48 8d 50 01 lea 0x1(%rax),%rdx
4001d1: 48 89 55 e8 mov %rdx,-0x18(%rbp)
4001d5: 0f b6 00 movzbl (%rax),%eax
4001d8: 84 c0 test %al,%al
4001da: 75 e9 jne 4001c5 <my_strlen+0x1d>
4001dc: 8b 45 fc mov -0x4(%rbp),%eax
4001df: 5d pop %rbp
4001e0: c3 retq
00000000004001e1 <_start>:
4001e1: 55 push %rbp
4001e2: 48 89 e5 mov %rsp,%rbp
4001e5: 48 83 ec 10 sub $0x10,%rsp
4001e9: 48 c7 45 f8 1d 02 40 movq $0x40021d,-0x8(%rbp)
4001f0: 00
4001f1: 48 8b 45 f8 mov -0x8(%rbp),%rax
4001f5: 48 89 c7 mov %rax,%rdi
4001f8: e8 ab ff ff ff callq 4001a8 <my_strlen>
4001fd: 89 c2 mov %eax,%edx
4001ff: 48 8b 45 f8 mov -0x8(%rbp),%rax
400203: 48 89 c6 mov %rax,%rsi
400206: bf 01 00 00 00 mov $0x1,%edi
40020b: e8 4a ff ff ff callq 40015a <my_write>
400210: bf 00 00 00 00 mov $0x0,%edi
400215: e8 2a ff ff ff callq 400144 <my_exit>
40021a: 90 nop
40021b: c9 leaveq
40021c: c3 retq
The assembly itself is not really that interesting, except that in my_write and my_exit you can see how the inline assembly generated by the SYSCALL...() macro just loads the variables into specific registers, and does the "do syscall" -- which just happens to be an x86-64 assembly instruction also called syscall here; in 32-bit x86 architecture, it is int $80, and yet something else in other architectures.
There is a final wrinkle, related to the reason why I used the prefix my_ for the functions analog to the functions in the C library: the C compiler can provide optimized shortcuts for some C library functions. For GCC, these are listed here; the list includes strlen().
This means we do not actually need the my_strlen() function, because we can use the optimized __builtin_strlen() function GCC provides, even in freestanding environment. The built-ins are usually very optimized; in the case of __builtin_strlen() on x86-64 using GCC-5.4.0, it optimizes to just a couple of register loads and a repnz scasb %es:(%rdi),%al instruction (which looks long, but actually takes just two bytes).
In other words, the final wrinkle is that there is a third type of function, compiler built-ins, that are provided by the compiler (but otherwise just like the functions provided by the C library) in optimized form, depending on the compiler options and architecture used.
If we were to expand the above example so that we'd open a file and write the Hello, world! into it, and compare low-level unistd.h (open()/write()/close()) and standard I/O stdio.h (fopen()/puts()/fclose()) approaches, we'd find that the major difference is in that the FILE handle used by the standard I/O approach contains a lot of extra stuff (that makes the standard file handles quite versatile, just not useful in such a trivial example), most visible in the buffering approach it has. On the assembly level, we'd still see the same syscalls -- open, write, close -- used.
Even though at first glance the ELF format (used for binaries in Linux) contains a lot of "unneeded stuff" (about a kilobyte for our example program above), it is actually a very powerful format. It, and the dynamic loader in Linux, provides a way to auto-load libraries when a program starts (using LD_PRELOAD environment variable), and to interpose functions in other libraries -- essentially, replace them with new ones, but with a way to still be able to call the original interposed version of the function. There are lots of useful tricks, fixes, experiments, and debugging methods these allow.
Although the distinction between "system call" and "library function" can be a useful one to keep in mind, there's the issue that you have to be able to call system calls somehow. In general, then, every system call is present in the C library -- as a thin little library function that does nothing but make the transfer to the system call (however that's implemented).
So, yes, you can call open() from C code if you want to. (And somewhere, perhaps in a file called fopen.c, the author of your C library probably called it too, within the implementation of fopen().)
The starting point for answering your question is to ask another question: What is a system call?
Generally, one thinks of a system call as a procedure that executes at an elevated processor privilege level. Generally, this means switching from user mode to kernel mode (some systems use multiple modes).
The mechanism for and application to enter kernel mode depends upon the system (and one Intel there are multiple ways). The general sequence for invoking a system service is the process executes an instruction that triggers a change processor mode exception. The CPU responds to the exception by invoking the appropriate exception/interrupt handler then dispatches to the appropriate operating system service.
The problem for C programming is that invoking a system service requires executing a specific hardware instruction and setting hardware register values. Operating systems provide wrapper functions that that handle the packing of parameters into registers, triggering the exception, then unpacking the return values from registers.
The open() function usually be a wrapper for high level languages to invoke system services. If you think about, fopen() is generally a "wrapper" for open().
So what we normally think of as a system call is a function that does nothing other than invoke a system service.

What is the advantage of GCC's __builtin_expect in if else statements?

I came across a #define in which they use __builtin_expect.
The documentation says:
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
would indicate that we do not expect to call foo, since we expect x to be zero.
So why not directly use:
if (x)
foo ();
instead of the complicated syntax with __builtin_expect?
Imagine the assembly code that would be generated from:
if (__builtin_expect(x, 0)) {
foo();
...
} else {
bar();
...
}
I guess it should be something like:
cmp $x, 0
jne _foo
_bar:
call bar
...
jmp after_if
_foo:
call foo
...
after_if:
You can see that the instructions are arranged in such an order that the bar case precedes the foo case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.
Before the jump is executed, the instructions below it (the bar case) are pushed to the pipeline. Since the foo case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.
Let's decompile to see what GCC 4.8 does with it
Blagovest mentioned branch inversion to improve the pipeline, but do current compilers really do it? Let's find out!
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 0a jne 1a <main+0x1a>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
The instruction order in memory was unchanged: first the puts and then retq return.
With __builtin_expect
Now replace if (i) with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 07 je 17 <main+0x17>
10: 31 c0 xor %eax,%eax
12: 48 83 c4 08 add $0x8,%rsp
16: c3 retq
17: bf 00 00 00 00 mov $0x0,%edi
18: R_X86_64_32 .rodata.str1.1
1c: e8 00 00 00 00 callq 21 <main+0x21>
1d: R_X86_64_PC32 puts-0x4
21: eb ed jmp 10 <main+0x10>
The puts was moved to the very end of the function, the retq return!
The new code is basically the same as:
int i = !time(NULL);
if (i)
goto puts;
ret:
return 0;
puts:
puts("a");
goto ret;
This optimization was not done with -O0.
But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart those days. My naive attempts are here.
C++20 [[likely]] and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.
The idea of __builtin_expect is to tell the compiler that you'll usually find that the expression evaluates to c, so that the compiler can optimize for that case.
I'd guess that someone thought they were being clever and that they were speeding things up by doing this.
Unfortunately, unless the situation is very well understood (it's likely that they have done no such thing), it may well have made things worse. The documentation even says:
In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
In general, you shouldn't be using __builtin_expect unless:
You have a very real performance issue
You've already optimized the algorithms in the system appropriately
You've got performance data to back up your assertion that a particular case is the most likely
Well, as it says in the description, the first version adds a predictive element to the construction, telling the compiler that the x == 0 branch is the more likely one - that is, it's the branch that will be taken more often by your program.
With that in mind, the compiler can optimize the conditional so that it requires the least amount of work when the expected condition holds, at the expense of maybe having to do more work in case of the unexpected condition.
Take a look at how conditionals are implemented during the compilation phase, and also in the resulting assembly, to see how one branch may be less work than the other.
However, I would only expect this optimization to have noticeable effect if the conditional in question is part of a tight inner loop that gets called a lot, since the difference in the resulting code is relatively small. And if you optimize it the wrong way round, you may well decrease your performance.
I don't see any of the answers addressing the question that I think you were asking, paraphrased:
Is there a more portable way of hinting branch prediction to the compiler.
The title of your question made me think of doing it this way:
if ( !x ) {} else foo();
If the compiler assumes that 'true' is more likely, it could optimize for not calling foo().
The problem here is just that you don't, in general, know what the compiler will assume -- so any code that uses this kind of technique would need to be carefully measured (and possibly monitored over time if the context changes).
I test it on Mac according #Blagovest Buyukliev and #Ciro. The assembles look clear and I add comments;
Commands are
gcc -c -O3 -std=gnu11 testOpt.c; otool -tVI testOpt.o
When I use -O3 , it looks the same no matter the __builtin_expect(i, 0) exist or not.
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp // open function stack
0000000000000004 xorl %edi, %edi // set time args 0 (NULL)
0000000000000006 callq _time // call time(NULL)
000000000000000b testq %rax, %rax // check time(NULL) result
000000000000000e je 0x14 // jump 0x14 if testq result = 0, namely jump to puts
0000000000000010 xorl %eax, %eax // return 0 , return appear first
0000000000000012 popq %rbp // return 0
0000000000000013 retq // return 0
0000000000000014 leaq 0x9(%rip), %rdi ## literal pool for: "a" // puts part, afterwards
000000000000001b callq _puts
0000000000000020 xorl %eax, %eax
0000000000000022 popq %rbp
0000000000000023 retq
When compile with -O2 , it looks different with and without __builtin_expect(i, 0)
First without
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e jne 0x1c // jump to 0x1c if not zero, then return
0000000000000010 leaq 0x9(%rip), %rdi ## literal pool for: "a" // put part appear first , following jne 0x1c
0000000000000017 callq _puts
000000000000001c xorl %eax, %eax // return part appear afterwards
000000000000001e popq %rbp
000000000000001f retq
Now with __builtin_expect(i, 0)
testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %edi, %edi
0000000000000006 callq _time
000000000000000b testq %rax, %rax
000000000000000e je 0x14 // jump to 0x14 if zero then put. otherwise return
0000000000000010 xorl %eax, %eax // return appear first
0000000000000012 popq %rbp
0000000000000013 retq
0000000000000014 leaq 0x7(%rip), %rdi ## literal pool for: "a"
000000000000001b callq _puts
0000000000000020 jmp 0x10
To summarize, __builtin_expect works in the last case.
In most of the cases, you should leave the branch prediction as it is and you do not need to worry about it.
One case where it is beneficial is CPU intensive algorithms with a lot of branching. In some cases, the jumps could lead the to exceed the current CPU program cache making the CPU wait for the next part of the software to run. By pushing the unlikely branches at the end, you will keep your memory close and only jump for unlikely cases.

Resources