I have recently been made aware of GCC's built-in functions for some of the C library's memory management functions, specifically __builtin_malloc() and related built-ins (see https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html). Upon learning about __builtin_malloc(), I was wondering how it might work to provide performance improvements over the plain malloc() related library routines.
For example, if the function succeeds, it has to provide a block that can be freed by a call to plain free() since the pointer might be freed by a module that was compiled without __builtin_malloc() or __builtin_free() enabled (or am I wrong about this,and if __builtin_malloc() is used, the builtins must be globally used?). Therefore the allocated object has to be something that can be managed with the data structures that plain malloc() and free() deal with.
I can't find any details of how __builtin_malloc() works or what it does exactly (I'm not a compiler dev, so spelunking through GCC source code isn't in my wheelhouse). In some simple tests where I've tried calling __builtin_malloc() directly, it simply ends up being emitted in the object code as a call to plain malloc(). However, there might be subtlety or platform detail that I'm not providing in these simple tests.
What kinds of performance improvements can __builtin_malloc() provide over a call to plain malloc()? Does __builtin_malloc() have a dependency on the rather complex data structures that glibc's malloc() implementation use? Or conversely, does glibc's malloc()/free() have some code to deal with blocks that might be allocated by __builtin_malloc()?
Basically, how does it work?
I believe there is no special GCC-internal implementation of __builtin_malloc(). Rather, it exists as a builtin only so it can be optimized away under certain circumstances.
Take this example:
#include <stdlib.h>
int main(void)
{
int *p = malloc(4);
*p = 7;
free(p);
return 0;
}
If we disable builtins (with -fno-builtins) and look at the generated output:
$ gcc -fno-builtins -O1 -Wall -Wextra builtin_malloc.c && objdump -d -Mintel a.out
0000000000400580 <main>:
400580: 48 83 ec 08 sub rsp,0x8
400584: bf 04 00 00 00 mov edi,0x4
400589: e8 f2 fe ff ff call 400480 <malloc#plt>
40058e: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
400594: 48 89 c7 mov rdi,rax
400597: e8 b4 fe ff ff call 400450 <free#plt>
40059c: b8 00 00 00 00 mov eax,0x0
4005a1: 48 83 c4 08 add rsp,0x8
4005a5: c3 ret
Calls to malloc/free are emitted, as expected.
However, by allowing malloc to be a builtin,
$ gcc -O1 -Wall -Wextra builtin_malloc.c && objdump -d -Mintel a.out
00000000004004f0 <main>:
4004f0: b8 00 00 00 00 mov eax,0x0
4004f5: c3 ret
All of main() was optimized away!
Essentially, by allowing malloc to be a builtin, GCC is free to eliminate calls if its result is never used, because there are no additional side-effects.
It's the same mechanism that allows "wasteful" calls to printf to be changed to calls to puts:
#include <stdio.h>
int main(void)
{
printf("hello\n");
return 0;
}
Builtins disabled:
$ gcc -fno-builtin -O1 -Wall builtin_printf.c && objdump -d -Mintel a.out
0000000000400530 <main>:
400530: 48 83 ec 08 sub rsp,0x8
400534: bf e0 05 40 00 mov edi,0x4005e0
400539: b8 00 00 00 00 mov eax,0x0
40053e: e8 cd fe ff ff call 400410 <printf#plt>
400543: b8 00 00 00 00 mov eax,0x0
400548: 48 83 c4 08 add rsp,0x8
40054c: c3 ret
Builtins enabled:
gcc -O1 -Wall builtin_printf.c && objdump -d -Mintel a.out
0000000000400530 <main>:
400530: 48 83 ec 08 sub rsp,0x8
400534: bf e0 05 40 00 mov edi,0x4005e0
400539: e8 d2 fe ff ff call 400410 <puts#plt>
40053e: b8 00 00 00 00 mov eax,0x0
400543: 48 83 c4 08 add rsp,0x8
400547: c3 ret
Related
While answering another question on Stack, I came upon a question myself. With the following code in shared.c:
#include "stdio.h"
int glob_var = 0;
void shared_func(){
printf("Hello World!");
glob_var = 3;
}
Compiled with:
gcc -shared -fPIC shared.c -oshared.so
If I disassemble the code with objdump -D shared.so, I get the following for shared_func:
0000000000001119 <shared_func>:
1119: f3 0f 1e fa endbr64
111d: 55 push %rbp
111e: 48 89 e5 mov %rsp,%rbp
1121: 48 8d 3d d8 0e 00 00 lea 0xed8(%rip),%rdi # 2000 <_fini+0xebc>
1128: b8 00 00 00 00 mov $0x0,%eax
112d: e8 1e ff ff ff callq 1050 <printf#plt>
1132: 48 8b 05 9f 2e 00 00 mov 0x2e9f(%rip),%rax # 3fd8 <glob_var##Base-0x54>
1139: c7 00 03 00 00 00 movl $0x3,(%rax)
113f: 90 nop
1140: 5d pop %rbp
1141: c3 retq
Using readelf -S shared.so, I get (for the GOT):
[22] .got PROGBITS 0000000000003fd8 00002fd8
0000000000000028 0000000000000008 WA 0 0 8
Correct me if I'm wrong but, looking at this, the relocation for accessing glob_var seems to be through the GOT. As I read on some websites, this is due to limitations in x86-64 machine code where RIP-relative addressing is limited to a 32 bits offset. This explanation is not satisfactory to me because, since the global variable is in the same shared object, then it is guaranteed to be found in its own data segment. The global variable could thus be accessed using RIP-relative addressing without an issue.
I would understand the GOT relocation if glob_var had been declared extern but, in this case, it is defined in the same shared object. Why does gcc require a relocation through the GOT? Is it because it is not able to detect that the global variable is defined within the same shared object?
Related: Why are nonstatic global variables defined in shared objects referenced using GOT?
The above is 11 years old and doesn't answer my question because there doesn't seem to be an appropriate answer there.
Bonus: what does <glob_var##Base-0x54> mean in the disassembly of shared_func? Why it isn't <glob_var#GOT>?
Thanks for any help!
When I compile my code below it prints
I am running :)
forever(Until I send KeyboardInterrupt signal to the program),
but when I uncomment // printf("done:%d\n", done);, recompile and run it, it will print only two times, prints done: 1 and then returns.
I'm new to ucontext.h and I'm very confused about how this code is working and
why a single printf is changing whole behavior of the code, if you replace printf with done++; it would do the same but if you replace it with done = 2; it does not affect anything and works as we had the printf commented at first place.
Can anyone explain:
Why is this code acting like this and what's the logic behind it?
Sorry for my bad English,
Thanks a lot.
#include <ucontext.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
register int done = 0;
ucontext_t one;
ucontext_t two;
getcontext(&one);
printf("I am running :)\n");
sleep(1);
if (!done)
{
done = 1;
swapcontext(&two, &one);
}
// printf("done:%d\n", done);
return 0;
}
This is a compiler optimization "problem". When the "printf()" is commented, the compiler deduces that "done" will not be used after the "if (!done)", so it does not set it to 1 as it is not worth. But when the "printf()" is present, "done" is used after "if (!done)", so the compiler sets it.
Assembly code with the "printf()":
$ gcc ctx.c -o ctx -g
$ objdump -S ctx
[...]
int main(void)
{
11e9: f3 0f 1e fa endbr64
11ed: 55 push %rbp
11ee: 48 89 e5 mov %rsp,%rbp
11f1: 48 81 ec b0 07 00 00 sub $0x7b0,%rsp
11f8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
11ff: 00 00
1201: 48 89 45 f8 mov %rax,-0x8(%rbp)
1205: 31 c0 xor %eax,%eax
register int done = 0;
1207: c7 85 5c f8 ff ff 00 movl $0x0,-0x7a4(%rbp) <------- done set to 0
120e: 00 00 00
ucontext_t one;
ucontext_t two;
getcontext(&one);
1211: 48 8d 85 60 f8 ff ff lea -0x7a0(%rbp),%rax
1218: 48 89 c7 mov %rax,%rdi
121b: e8 c0 fe ff ff callq 10e0 <getcontext#plt>
1220: f3 0f 1e fa endbr64
printf("I am running :)\n");
1224: 48 8d 3d d9 0d 00 00 lea 0xdd9(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
122b: e8 70 fe ff ff callq 10a0 <puts#plt>
sleep(1);
1230: bf 01 00 00 00 mov $0x1,%edi
1235: e8 b6 fe ff ff callq 10f0 <sleep#plt>
if (!done)
123a: 83 bd 5c f8 ff ff 00 cmpl $0x0,-0x7a4(%rbp)
1241: 75 27 jne 126a <main+0x81>
{
done = 1;
1243: c7 85 5c f8 ff ff 01 movl $0x1,-0x7a4(%rbp) <----- done set to 1
124a: 00 00 00
swapcontext(&two, &one);
124d: 48 8d 95 60 f8 ff ff lea -0x7a0(%rbp),%rdx
1254: 48 8d 85 30 fc ff ff lea -0x3d0(%rbp),%rax
125b: 48 89 d6 mov %rdx,%rsi
125e: 48 89 c7 mov %rax,%rdi
1261: e8 6a fe ff ff callq 10d0 <swapcontext#plt>
1266: f3 0f 1e fa endbr64
}
printf("done:%d\n", done);
126a: 8b b5 5c f8 ff ff mov -0x7a4(%rbp),%esi
1270: 48 8d 3d 9d 0d 00 00 lea 0xd9d(%rip),%rdi # 2014 <_IO_stdin_used+0x14>
1277: b8 00 00 00 00 mov $0x0,%eax
127c: e8 3f fe ff ff callq 10c0 <printf#plt>
return 0;
Assembly code without the "printf()":
$ gcc ctx.c -o ctx -g
$ objdump -S ctx
[...]
int main(void)
{
11c9: f3 0f 1e fa endbr64
11cd: 55 push %rbp
11ce: 48 89 e5 mov %rsp,%rbp
11d1: 48 81 ec b0 07 00 00 sub $0x7b0,%rsp
11d8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
11df: 00 00
11e1: 48 89 45 f8 mov %rax,-0x8(%rbp)
11e5: 31 c0 xor %eax,%eax
register int done = 0;
11e7: c7 85 5c f8 ff ff 00 movl $0x0,-0x7a4(%rbp) <------ done set to 0
11ee: 00 00 00
ucontext_t one;
ucontext_t two;
getcontext(&one);
11f1: 48 8d 85 60 f8 ff ff lea -0x7a0(%rbp),%rax
11f8: 48 89 c7 mov %rax,%rdi
11fb: e8 c0 fe ff ff callq 10c0 <getcontext#plt>
1200: f3 0f 1e fa endbr64
printf("I am running :)\n");
1204: 48 8d 3d f9 0d 00 00 lea 0xdf9(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
120b: e8 80 fe ff ff callq 1090 <puts#plt>
sleep(1);
1210: bf 01 00 00 00 mov $0x1,%edi
1215: e8 b6 fe ff ff callq 10d0 <sleep#plt>
if (!done)
121a: 83 bd 5c f8 ff ff 00 cmpl $0x0,-0x7a4(%rbp)
1221: 75 1d jne 1240 <main+0x77>
{
done = 1; <------------- done is no set here (it is optimized by the compiler)
swapcontext(&two, &one);
1223: 48 8d 95 60 f8 ff ff lea -0x7a0(%rbp),%rdx
122a: 48 8d 85 30 fc ff ff lea -0x3d0(%rbp),%rax
1231: 48 89 d6 mov %rdx,%rsi
1234: 48 89 c7 mov %rax,%rdi
1237: e8 74 fe ff ff callq 10b0 <swapcontext#plt>
123c: f3 0f 1e fa endbr64
}
//printf("done:%d\n", done);
return 0;
1240: b8 00 00 00 00 mov $0x0,%eax
}
1245: 48 8b 4d f8 mov -0x8(%rbp),%rcx
1249: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
1250: 00 00
1252: 74 05 je 1259 <main+0x90>
1254: e8 47 fe ff ff callq 10a0 <__stack_chk_fail#plt>
1259: c9 leaveq
125a: c3 retq
125b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
To disable the optimization on "done", add the "volatile" keyword in its definition:
volatile register int done = 0;
This makes the program work in both cases.
(There is some overlap with Rachid K's answer as it was posted while I was writing this.)
I am guessing you are declaring done as register in hopes that it will actually be put in a register, so that its value will be saved and restored by the context switch. But the compiler is never obliged to honor this; most modern compilers ignore register declarations completely and make their own decisions about register usage. And in particular, gcc without optimizations will nearly always put local variables in memory on the stack.
As such, in your test case, the value of done is not restored by the context switch. So when getcontext returns for the second time, done has the same value as when swapcontext was called.
When the printf is present, as Rachid also points out, the done = 1 is actually stored before the swapcontext, so on the second return of getcontext, done has the value 1, the if block is skipped, and the program prints done:1 and exits.
However, when the printf is absent, the compiler notices that the value of done is never used after its assignment (since it assumes swapcontext is a normal function and doesn't know that it will actually return somewhere else), so it optimizes out the dead store (yes, even though optimizations are off). Thus we have done == 0 when getcontext returns the second time, and you get an infinite loop. This is maybe what you were expecting if you thought done would be placed in a register, but if so, you got the "right" behavior for the wrong reason.
If you enable optimizations, you'll see something else again: the compiler notices that done can't be affected by the call to getcontext (again assuming it's a normal function call) and therefore it is guaranteed to be 0 at the if. So the test need not be done at all, because it will always be true. The swapcontext is then executed unconditionally, and as for done, it's optimized completely out of existence, because it no longer has any effect on the code. You'll again see an infinite loop.
Because of this issue, you really can't make any safe assumptions about local variables that have been modified in between the getcontext and swapcontext. When getcontext returns for the second time, you might or might not see the changes. There are further issues if the compiler chose to reorder some of your code around the function call (which it knows no reason not to do, since again it thinks these are ordinary function calls that can't see your local variables).
The only way to get any certainty is to declare a variable volatile. Then you can be sure that intermediate changes will be seen, and the compiler will not assume that getcontext can't change it. The value seen at the second return of getcontext will be the same as at the call to swapcontext. If you write volatile int done = 0; you ought to see just two "I am running" messages, regardless of other code or optimization settings.
Background
I'm making an app that needs to run several tasks concurrently. I can't use threads and such because the app should work without any OS (i.e. straight from the bootsector). Using x86 tasks looks like an overkill (both logically and performance-wise). Thus, I decided to implement a task-switching utility myself. I would save processor state, make a call to the task code and then restore the previous state. So I have to make the call from inline assembly.
Problem
Here's some example code:
#include <stdio.h>
void func() {
printf("Hello, world!\n");
}
void (*funcptr)();
int main() {
funcptr = func;
asm(
"call *%0;"
:
:"r"(funcptr)
);
return 0;
}
It compiles perfectly under icc with no options, gcc and clang and yields "Hello, world!" when run. However, if I compile it with icc main.c -ipo, it segfaults.
I disassembled the code that was generated by icc main.c and got the following:
0000000000401220 <main>:
401220: 55 push %rbp
401221: 48 89 e5 mov %rsp,%rbp
401224: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401228: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40122f: bf 03 00 00 00 mov $0x3,%edi
401234: 33 f6 xor %esi,%esi
401236: e8 45 00 00 00 callq 401280 <__intel_new_feature_proc_init>
40123b: 0f ae 1c 24 stmxcsr (%rsp)
40123f: 48 c7 05 f6 78 00 00 movq $0x401270,0x78f6(%rip) # 408b40 <funcptr>
401246: 70 12 40 00
40124a: b8 70 12 40 00 mov $0x401270,%eax
40124f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401256: 0f ae 14 24 ldmxcsr (%rsp)
40125a: ff d0 callq *%rax
40125c: 33 c0 xor %eax,%eax
40125e: 48 89 ec mov %rbp,%rsp
401261: 5d pop %rbp
401262: c3 retq
401263: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401268: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40126f: 00
0000000000401270 <func>:
401270: bf 04 40 40 00 mov $0x404004,%edi
401275: e9 e6 fd ff ff jmpq 401060 <puts#plt>
40127a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
On the other hand, icc main.c -ipo yields:
0000000000401210 <main>:
401210: 55 push %rbp
401211: 48 89 e5 mov %rsp,%rbp
401214: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401218: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40121f: bf 03 00 00 00 mov $0x3,%edi
401224: 33 f6 xor %esi,%esi
401226: e8 25 00 00 00 callq 401250 <__intel_new_feature_proc_init>
40122b: 0f ae 1c 24 stmxcsr (%rsp)
40122f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401236: 48 8b 05 cb 2d 00 00 mov 0x2dcb(%rip),%rax # 404008 <funcptr_2.dp.0>
40123d: 0f ae 14 24 ldmxcsr (%rsp)
401241: ff d0 callq *%rax
401243: 33 c0 xor %eax,%eax
401245: 48 89 ec mov %rbp,%rsp
401248: 5d pop %rbp
401249: c3 retq
40124a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
So, while -ipo didn't remove funcptr variable (see address 401236), it did remove assignment. I guess that icc noticed that func is not called from C code so it can be safely removed, so funcptr is allowed to contain garbage. However, it didn't notice that I'm calling func indirectly via assembly.
What I tried
Replacing "r"(funcptr) with "r"(func) works but I can't hardcode a specific function (see background).
Calling funcptr and/or func before and/or after inline assembly block don't help because icc just inlines printf("Hello, world!\n");.
I can't get rid of inline assembly because I have to do low-level register, flags and stack manipulation before and after call.
Making funcptr volatile yields the following warning but still segfaults:
a value of type "void (*)()" cannot be assigned to an entity of type "volatile void (*)()"
Adding volatile to almost every other word doesn't help either.
Moving func and/or funcptr to other source files and then linking them together doesn't help.
Moving inline assembly to a separate function doesn't work.
Am I doing something wrong or is it an icc bug? If the former, how do I fix the code? If the latter, is there any workaround and should I report the bug?
$ icc --version
icc (ICC) 19.1.0.166 20191121
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.
Here gdb does not stop at Line:4.
Next,
Without hitting the declaration line at Line:5, variable x is existing and initialized.
Next,
But here it shows out of scope (yes it should according to me).
Now, I have the following doubts regarding this particular instance of c program.
When exactly the memory for variable x in P1() gets created and initialized?
why gdb did not stop at static declaration statement in inside P1() in the first example?
If we call P1() again will the program control simply skip the declaration statement?
It has already been explained (in related topics linked in comments below question) how static variables work.
Here is actual code generated by a gcc for your p1 function (by gcc -c -O0 -fomit-frame-pointer -g3 staticvar.c -o staticvar.o) then disassembled with related source.
Disassembly of section .text:
0000000000000000 <p1>:
#include <stdio.h>
void p1(void)
{
0: 48 83 ec 08 sub $0x8,%rsp
static int x = 10;
x += 5;
4: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # a <p1+0xa>
a: 83 c0 05 add $0x5,%eax
d: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 13 <p1+0x13>
printf("%d\n", x);
13: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 19 <p1+0x19>
19: 89 c6 mov %eax,%esi
1b: bf 00 00 00 00 mov $0x0,%edi
20: b8 00 00 00 00 mov $0x0,%eax
25: e8 00 00 00 00 callq 2a <p1+0x2a>
}
2a: 90 nop
2b: 48 83 c4 08 add $0x8,%rsp
2f: c3 retq
So, as you see there is no code for declaration of x. GDB can only break on actual machine code instruction and as there is none, it breaks on next instruction (mov), which matches line 5.
[Update] 2016.07.02
main.c
#include <stdio.h>
#include <string.h>
size_t strlen(const char *str) {
printf("%s\n", str);
return 99;
}
int main() {
const char *str = "AAA";
size_t a = strlen(str);
strlen(str);
size_t b = strlen("BBB");
return 0;
}
The expected output is
AAA
AAA
BBB
Compile with gcc -O0 -o main main.c :
AAA
Compile with gcc -O3 -o main main.c
AAA
AAA
The corresponding asm code with flag -O0
000000000040057c <main>:
40057c: 55 push %rbp
40057d: 48 89 e5 mov %rsp,%rbp
400580: 48 83 ec 20 sub $0x20,%rsp
400584: 48 c7 45 e8 34 06 40 movq $0x400634,-0x18(%rbp)
40058b: 00
40058c: 48 8b 45 e8 mov -0x18(%rbp),%rax
400590: 48 89 c7 mov %rax,%rdi
400593: e8 c5 ff ff ff callq 40055d <strlen>
400598: 48 89 45 f0 mov %rax,-0x10(%rbp)
40059c: 48 c7 45 f8 03 00 00 movq $0x3,-0x8(%rbp)
4005a3: 00
4005a4: b8 00 00 00 00 mov $0x0,%eax
4005a9: c9 leaveq
4005aa: c3 retq
4005ab: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
and with -O3 :
0000000000400470 <main>:
400470: 48 83 ec 08 sub $0x8,%rsp
400474: bf 24 06 40 00 mov $0x400624,%edi
400479: e8 c2 ff ff ff callq 400440 <puts#plt>
40047e: bf 24 06 40 00 mov $0x400624,%edi
400483: e8 b8 ff ff ff callq 400440 <puts#plt>
400488: 31 c0 xor %eax,%eax
40048a: 48 83 c4 08 add $0x8,%rsp
40048e: c3 retq
With the flag -O0, why is the second and third call to strlen did not call the user defined strlen ?
And with -O3, why is the third call to strlen been optimized out?
GCC recognizes strlen() has a builtin and replaced it builtins. Hence, your version of strlen() isn't called.
I compiled your code with -fno-builtin which disables the builtins and I get the Log in strlen output twice from the 1st and 3rd strlen() calls. Probably the second strlen() gets optimized away as it's return value isn't used. This probably happens before GCC recognizes that it can't use the builtin for strlen(). Otherwise, it can't optimize away 2nd strlen() call because it has the side effect of printing the message.
Similarly, if store the result of 2nd strlen() call with something like:
size_t b = strlen(str); // call 2
then I see, "Log in strlen()" getting printed 3 times.
If I compile with -O3 (either with or without -fno-builtin), I get no output at all because, as said before, GCC optimizes away the whole
program.
This is not an issue with GCC because redefining a standard function is technically undefined behaviour and hence GCC has the liberty handle it in anyway.
Probably build in function enabled by default. If you wise call user define function, then change return type of strlen function.So,
in strlen_adder.h
int strlen(const char*);
in strlen_adder.c
int strlen(const char *s)
Also remove #include<string.h> in main.c file.