[Update] 2016.07.02
main.c
#include <stdio.h>
#include <string.h>
size_t strlen(const char *str) {
printf("%s\n", str);
return 99;
}
int main() {
const char *str = "AAA";
size_t a = strlen(str);
strlen(str);
size_t b = strlen("BBB");
return 0;
}
The expected output is
AAA
AAA
BBB
Compile with gcc -O0 -o main main.c :
AAA
Compile with gcc -O3 -o main main.c
AAA
AAA
The corresponding asm code with flag -O0
000000000040057c <main>:
40057c: 55 push %rbp
40057d: 48 89 e5 mov %rsp,%rbp
400580: 48 83 ec 20 sub $0x20,%rsp
400584: 48 c7 45 e8 34 06 40 movq $0x400634,-0x18(%rbp)
40058b: 00
40058c: 48 8b 45 e8 mov -0x18(%rbp),%rax
400590: 48 89 c7 mov %rax,%rdi
400593: e8 c5 ff ff ff callq 40055d <strlen>
400598: 48 89 45 f0 mov %rax,-0x10(%rbp)
40059c: 48 c7 45 f8 03 00 00 movq $0x3,-0x8(%rbp)
4005a3: 00
4005a4: b8 00 00 00 00 mov $0x0,%eax
4005a9: c9 leaveq
4005aa: c3 retq
4005ab: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
and with -O3 :
0000000000400470 <main>:
400470: 48 83 ec 08 sub $0x8,%rsp
400474: bf 24 06 40 00 mov $0x400624,%edi
400479: e8 c2 ff ff ff callq 400440 <puts#plt>
40047e: bf 24 06 40 00 mov $0x400624,%edi
400483: e8 b8 ff ff ff callq 400440 <puts#plt>
400488: 31 c0 xor %eax,%eax
40048a: 48 83 c4 08 add $0x8,%rsp
40048e: c3 retq
With the flag -O0, why is the second and third call to strlen did not call the user defined strlen ?
And with -O3, why is the third call to strlen been optimized out?
GCC recognizes strlen() has a builtin and replaced it builtins. Hence, your version of strlen() isn't called.
I compiled your code with -fno-builtin which disables the builtins and I get the Log in strlen output twice from the 1st and 3rd strlen() calls. Probably the second strlen() gets optimized away as it's return value isn't used. This probably happens before GCC recognizes that it can't use the builtin for strlen(). Otherwise, it can't optimize away 2nd strlen() call because it has the side effect of printing the message.
Similarly, if store the result of 2nd strlen() call with something like:
size_t b = strlen(str); // call 2
then I see, "Log in strlen()" getting printed 3 times.
If I compile with -O3 (either with or without -fno-builtin), I get no output at all because, as said before, GCC optimizes away the whole
program.
This is not an issue with GCC because redefining a standard function is technically undefined behaviour and hence GCC has the liberty handle it in anyway.
Probably build in function enabled by default. If you wise call user define function, then change return type of strlen function.So,
in strlen_adder.h
int strlen(const char*);
in strlen_adder.c
int strlen(const char *s)
Also remove #include<string.h> in main.c file.
Related
I've created the following function in c as a demonstration/small riddle about how the stack works in c:
#include "stdio.h"
int* func(int i)
{
int j = 3;
j += i;
return &j;
}
int main()
{
int *tmp = func(4);
printf("%d\n", *tmp);
func(5);
printf("%d\n", *tmp);
}
It's obviously undefined behavior and the compiler also produces a warning about that. However unfortunately the compilation didn't quite work out. For some reason gcc replaces the returned pointer by NULL (see line 6d6).
00000000000006aa <func>:
6aa: 55 push %rbp
6ab: 48 89 e5 mov %rsp,%rbp
6ae: 48 83 ec 20 sub $0x20,%rsp
6b2: 89 7d ec mov %edi,-0x14(%rbp)
6b5: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
6bc: 00 00
6be: 48 89 45 f8 mov %rax,-0x8(%rbp)
6c2: 31 c0 xor %eax,%eax
6c4: c7 45 f4 03 00 00 00 movl $0x3,-0xc(%rbp)
6cb: 8b 55 f4 mov -0xc(%rbp),%edx
6ce: 8b 45 ec mov -0x14(%rbp),%eax
6d1: 01 d0 add %edx,%eax
6d3: 89 45 f4 mov %eax,-0xc(%rbp)
6d6: b8 00 00 00 00 mov $0x0,%eax
6db: 48 8b 4d f8 mov -0x8(%rbp),%rcx
6df: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
6e6: 00 00
6e8: 74 05 je 6ef <func+0x45>
6ea: e8 81 fe ff ff callq 570 <__stack_chk_fail#plt>
6ef: c9 leaveq
6f0: c3 retq
This is the disassembly of the binary compiled with gcc version 7.5.0 and the -O0-flag; no other flags were used. This behavior makes the entire code pointless, since it's supposed to show how the stack behaves across function-calls. Is there any way to achieve a more literal compilation of this code with a at least somewhat up-to-date version of gcc?
And just for the sake of curiosity: what's the point of changing the behavior of the code like this in the first place?
Putting the return value in a pointer variable seems to change the behavior of the compiler and it generates the assembly code that returns a pointer to stack:
int* func(int i) {
int j = 3;
j += i;
int *p = &j;
return p;
}
I know and understand the purpose of volatile variables and optimisation in general (well, I think I do!). This question relates specifically to what happens if a variable is accessed outside the module it is declared in.
In the following scenario, if funcThatWaits was called inside bar.c, it could be optimised and not fetch the value of sTheVar each loop iteration.
However, when GetTheVar is called externally could the same optimisation apply or does the function call ensure sTheVar will always be read each loop iteration?
I am not suggesting this is good code or practice, but an example for the sake of the question.
bar.h
int GetTheVar(void);
bar.c
static /*volatile*/ int sTheVar;
int GetTheVar(void)
{
return sTheVar;
}
static void someISROrFuncCalledFromAnotherThread(void)
{
sTheVar = 1;
}
foo.c
#include "bar.h"
void funcThatWaits(void)
{
while(GetTheVar() != 1) {}
}
when GetTheVar is called externally could the same optimisation apply or does the function call ensure sTheVar will always be read each loop iteration?
The same optimization may apply. For instance, if you are using LTO (Link-Time Optimization), then the compiler knows everything about GetTheVar and will likely decide funcThatWaits is an infinite loop (which, by the way, would be UB).
Function calls are not going to be optimized away since, for all the caller knows, the function being called could depend on some exogenous state.
I compiled the following three files using gcc:
foo.c
#include "bar.h"
void funcThatWaits(void) {
while ( getVar() != 1 );
}
bar.c
#include "foo.h"
static int theVar;
int getTheVar(void) {
return theVar;
}
void theFunc(void) {
funcThatWaits();
}
test.c
#include "bar.h"
int main() {
theFunc();
return 0;
}
Compiling those three into a.out and running objdump -d a.out, the following comes out:
00000000000005fa <main>:
5fa: 55 push %rbp
5fb: 48 89 e5 mov %rsp,%rbp
5fe: e8 25 00 00 00 callq 628 <theFunc>
603: b8 00 00 00 00 mov $0x0,%eax
608: 5d pop %rbp
609: c3 retq
000000000000060a <funcThatWaits>:
60a: 55 push %rbp
60b: 48 89 e5 mov %rsp,%rbp
60e: 90 nop
60f: e8 08 00 00 00 callq 61c <getTheVar>
614: 83 f8 01 cmp $0x1,%eax
617: 75 f6 jne 60f <funcThatWaits+0x5>
619: 90 nop
61a: 5d pop %rbp
61b: c3 retq
000000000000061c <getTheVar>:
61c: 55 push %rbp
61d: 48 89 e5 mov %rsp,%rbp
620: 8b 05 ee 09 20 00 mov 0x2009ee(%rip),%eax # 201014 <theVar>
626: 5d pop %rbp
627: c3 retq
0000000000000628 <theFunc>:
628: 55 push %rbp
629: 48 89 e5 mov %rsp,%rbp
62c: e8 d9 ff ff ff callq 60a <funcThatWaits>
631: 90 nop
632: 5d pop %rbp
633: c3 retq
634: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
63b: 00 00 00
63e: 66 90 xchg %ax,%ax
I have the following code:
void cp(void *a, const void *b, int n) {
for (int i = 0; i < n; ++i) {
((char *) a)[i] = ((const char *) b)[i];
}
}
void _start(void) {
char buf[20];
const char m[] = "123456789012345";
cp(buf, m, 15);
register int rax __asm__ ("rax") = 60; // exit
register int rdi __asm__ ("rdi") = 0; // status
__asm__ volatile (
"syscall" :: "r" (rax), "r" (rdi) : "cc", "rcx", "r11"
);
__builtin_unreachable();
}
If I compile it with gcc -nostdlib -O1 "./a.c" -o "./a", I get a functioning program, but if I compile it with -O2, I get a program that generates a segmentation fault.
This is the generated code with -O1:
0000000000001000 <cp>:
1000: b8 00 00 00 00 mov $0x0,%eax
1005: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
1009: 88 14 07 mov %dl,(%rdi,%rax,1)
100c: 48 83 c0 01 add $0x1,%rax
1010: 48 83 f8 0f cmp $0xf,%rax
1014: 75 ef jne 1005 <cp+0x5>
1016: c3 retq
0000000000001017 <_start>:
1017: 48 83 ec 30 sub $0x30,%rsp
101b: 48 b8 31 32 33 34 35 movabs $0x3837363534333231,%rax
1022: 36 37 38
1025: 48 ba 39 30 31 32 33 movabs $0x35343332313039,%rdx
102c: 34 35 00
102f: 48 89 04 24 mov %rax,(%rsp)
1033: 48 89 54 24 08 mov %rdx,0x8(%rsp)
1038: 48 89 e6 mov %rsp,%rsi
103b: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
1040: ba 0f 00 00 00 mov $0xf,%edx
1045: e8 b6 ff ff ff callq 1000 <cp>
104a: b8 3c 00 00 00 mov $0x3c,%eax
104f: bf 00 00 00 00 mov $0x0,%edi
1054: 0f 05 syscall
And this is the generated code with -O2:
0000000000001000 <cp>:
1000: 31 c0 xor %eax,%eax
1002: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1008: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
100c: 88 14 07 mov %dl,(%rdi,%rax,1)
100f: 48 83 c0 01 add $0x1,%rax
1013: 48 83 f8 0f cmp $0xf,%rax
1017: 75 ef jne 1008 <cp+0x8>
1019: c3 retq
101a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000001020 <_start>:
1020: 48 8d 44 24 d8 lea -0x28(%rsp),%rax
1025: 48 8d 54 24 c9 lea -0x37(%rsp),%rdx
102a: b9 31 00 00 00 mov $0x31,%ecx
102f: 66 0f 6f 05 c9 0f 00 movdqa 0xfc9(%rip),%xmm0 # 2000 <_start+0xfe0>
1036: 00
1037: 48 8d 70 0f lea 0xf(%rax),%rsi
103b: 0f 29 44 24 c8 movaps %xmm0,-0x38(%rsp)
1040: eb 0d jmp 104f <_start+0x2f>
1042: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1048: 0f b6 0a movzbl (%rdx),%ecx
104b: 48 83 c2 01 add $0x1,%rdx
104f: 88 08 mov %cl,(%rax)
1051: 48 83 c0 01 add $0x1,%rax
1055: 48 39 f0 cmp %rsi,%rax
1058: 75 ee jne 1048 <_start+0x28>
105a: b8 3c 00 00 00 mov $0x3c,%eax
105f: 31 ff xor %edi,%edi
1061: 0f 05 syscall
The crash happens at 103b, instruction movaps %xmm0,-0x38(%rsp).
I noticed that if m contains less than 15 characters, then the generated code is different and the crash does not happen.
What am I doing wrong?
_start is not a function. It's not called by anything, and on entry the stack is 16-byte aligned, not (as the ABI requires) 8 bytes away from 16-byte alignment.
(The ABI requires 16-byte alignment before a call, and call pushes an 8-byte return address. So on function entry RSP-8 and RSP+8 are 16-byte aligned.)
At -O2 GCC uses alignment-required 16-byte instructions to implement the copy done by cp(), copying the "123456789012345" from static storage to the stack.
At -O1, GCC just uses two mov r64, imm64 instructions to get bytes into integer regs for 8-byte stores. These don't require alignment.
Workarounds
Just write a main in C like a normal person if you want everything to work.
Or if you're trying to microbenchmark something light-weight in asm, you can use gcc -nostdlib -O3 -mincoming-stack-boundary=3 (docs) to tell GCC that functions can't assume they're called with more than 8-byte alignment. Unlike -mpreferred-stack-boundary=3, this will still align by 16 before making further calls. So if you have other non-leaf functions, you might want to just use an attribute on your hacky C _start() instead of affecting the whole file.
A worse, more hacky way would be to try putting
asm("push %rax"); at the very top of _start to modify RSP by 8, where GCC hopefully runs it before doing anything else with the stack. GNU C Basic asm statements are implicitly volatile so you don't need asm volatile, although that wouldn't hurt.
You're 100% on your own and responsible for correctly tricking the compiler by using inline asm that works for whatever optimization level you're using.
Another safer way would be write your own light-weight _start that calls main:
// at global scope:
asm(
".globl _start \n"
"_start: \n"
" mov (%rsp), %rdi \n" // argc
" lea 8(%rsp), %rsi \n" // argv
" lea 8(%rsi, %rdi, 8), %rdx \n" // envp
" call main \n"
// NOT DONE: stdio cleanup or other atexit stuff
// DO NOT USE WITH GLIBC; use libc's CRT code if you use libc
" mov %eax, %edi \n"
" mov $231, %eax \n"
" syscall" // exit_group( main() )
);
int main(int argc, char**argv, char**envp) {
... your code here
return 0;
}
If you didn't want main to return, you could just pop %rdi; mov %rsp, %rsi ; jmp main to give it argc and argv without a return address.
Then main can exit via inline asm, or by calling exit() or _exit() if you link libc. (But if you link libc, you should usually use its _start.)
See also: How Get arguments value using inline assembly in C without Glibc? for other hand-rolled _start versions; this is pretty much like #zwol's there.
I have recently been made aware of GCC's built-in functions for some of the C library's memory management functions, specifically __builtin_malloc() and related built-ins (see https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html). Upon learning about __builtin_malloc(), I was wondering how it might work to provide performance improvements over the plain malloc() related library routines.
For example, if the function succeeds, it has to provide a block that can be freed by a call to plain free() since the pointer might be freed by a module that was compiled without __builtin_malloc() or __builtin_free() enabled (or am I wrong about this,and if __builtin_malloc() is used, the builtins must be globally used?). Therefore the allocated object has to be something that can be managed with the data structures that plain malloc() and free() deal with.
I can't find any details of how __builtin_malloc() works or what it does exactly (I'm not a compiler dev, so spelunking through GCC source code isn't in my wheelhouse). In some simple tests where I've tried calling __builtin_malloc() directly, it simply ends up being emitted in the object code as a call to plain malloc(). However, there might be subtlety or platform detail that I'm not providing in these simple tests.
What kinds of performance improvements can __builtin_malloc() provide over a call to plain malloc()? Does __builtin_malloc() have a dependency on the rather complex data structures that glibc's malloc() implementation use? Or conversely, does glibc's malloc()/free() have some code to deal with blocks that might be allocated by __builtin_malloc()?
Basically, how does it work?
I believe there is no special GCC-internal implementation of __builtin_malloc(). Rather, it exists as a builtin only so it can be optimized away under certain circumstances.
Take this example:
#include <stdlib.h>
int main(void)
{
int *p = malloc(4);
*p = 7;
free(p);
return 0;
}
If we disable builtins (with -fno-builtins) and look at the generated output:
$ gcc -fno-builtins -O1 -Wall -Wextra builtin_malloc.c && objdump -d -Mintel a.out
0000000000400580 <main>:
400580: 48 83 ec 08 sub rsp,0x8
400584: bf 04 00 00 00 mov edi,0x4
400589: e8 f2 fe ff ff call 400480 <malloc#plt>
40058e: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
400594: 48 89 c7 mov rdi,rax
400597: e8 b4 fe ff ff call 400450 <free#plt>
40059c: b8 00 00 00 00 mov eax,0x0
4005a1: 48 83 c4 08 add rsp,0x8
4005a5: c3 ret
Calls to malloc/free are emitted, as expected.
However, by allowing malloc to be a builtin,
$ gcc -O1 -Wall -Wextra builtin_malloc.c && objdump -d -Mintel a.out
00000000004004f0 <main>:
4004f0: b8 00 00 00 00 mov eax,0x0
4004f5: c3 ret
All of main() was optimized away!
Essentially, by allowing malloc to be a builtin, GCC is free to eliminate calls if its result is never used, because there are no additional side-effects.
It's the same mechanism that allows "wasteful" calls to printf to be changed to calls to puts:
#include <stdio.h>
int main(void)
{
printf("hello\n");
return 0;
}
Builtins disabled:
$ gcc -fno-builtin -O1 -Wall builtin_printf.c && objdump -d -Mintel a.out
0000000000400530 <main>:
400530: 48 83 ec 08 sub rsp,0x8
400534: bf e0 05 40 00 mov edi,0x4005e0
400539: b8 00 00 00 00 mov eax,0x0
40053e: e8 cd fe ff ff call 400410 <printf#plt>
400543: b8 00 00 00 00 mov eax,0x0
400548: 48 83 c4 08 add rsp,0x8
40054c: c3 ret
Builtins enabled:
gcc -O1 -Wall builtin_printf.c && objdump -d -Mintel a.out
0000000000400530 <main>:
400530: 48 83 ec 08 sub rsp,0x8
400534: bf e0 05 40 00 mov edi,0x4005e0
400539: e8 d2 fe ff ff call 400410 <puts#plt>
40053e: b8 00 00 00 00 mov eax,0x0
400543: 48 83 c4 08 add rsp,0x8
400547: c3 ret
I often notice severely mangled output with mixed assembly and C instructions in the output of objdump -S. This seems to happen only for binaries built with debug info. Is there any way to fix this?
To illustrate the issue i have written a simple program :
/* test.c */
#include <stdio.h>
int main()
{
static int i = 0;
while(i < 0x1000000) {
i++;
}
return 0;
}
The above program was built with/without debug info as follows :
$ gcc test.c -o test-release
$ gcc test.c -g -o test-debug
Disassembling the test-release binary works fine.
$ objdump -S test-release
produces the following clear and concise snippet for the main() function.
080483b4 <main>:
80483b4: 55 push %ebp
80483b5: 89 e5 mov %esp,%ebp
80483b7: eb 0d jmp 80483c6 <main+0x12>
80483b9: a1 18 a0 04 08 mov 0x804a018,%eax
80483be: 83 c0 01 add $0x1,%eax
80483c1: a3 18 a0 04 08 mov %eax,0x804a018
80483c6: a1 18 a0 04 08 mov 0x804a018,%eax
80483cb: 3d ff ff ff 00 cmp $0xffffff,%eax
80483d0: 7e e7 jle 80483b9 <main+0x5>
80483d2: b8 00 00 00 00 mov $0x0,%eax
80483d7: 5d pop %ebp
80483d8: c3 ret
But $ objdump -S test-debug
produces the following mangled snippet for the same main() function.
080483b4 <main>:
#include <stdio.h>
int main()
{
80483b4: 55 push %ebp
80483b5: 89 e5 mov %esp,%ebp
static int i = 0;
while(i < 0x1000000) {
80483b7: eb 0d jmp 80483c6 <main+0x12>
i++;
80483b9: a1 18 a0 04 08 mov 0x804a018,%eax
80483be: 83 c0 01 add $0x1,%eax
80483c1: a3 18 a0 04 08 mov %eax,0x804a018
int main()
{
static int i = 0;
while(i < 0x1000000) {
80483c6: a1 18 a0 04 08 mov 0x804a018,%eax
80483cb: 3d ff ff ff 00 cmp $0xffffff,%eax
80483d0: 7e e7 jle 80483b9 <main+0x5>
i++;
}
return 0;
80483d2: b8 00 00 00 00 mov $0x0,%eax
}
80483d7: 5d pop %ebp
80483d8: c3 ret
I do understand that as the debug binary contains additional symbol info, the C code is displayed interlaced with the assembly instructions. But this makes it a tad difficult to follow the flow of code.
Is there any way to instruct objdump to output pure assembly and not interlace debug symbols into the output even if encountered in a binary?
Use -d instead of -S. objdump is doing exactly what you are telling it to. The -S option implies -d but also displays the C source if debugging information is available.