I have the following code:
void cp(void *a, const void *b, int n) {
for (int i = 0; i < n; ++i) {
((char *) a)[i] = ((const char *) b)[i];
}
}
void _start(void) {
char buf[20];
const char m[] = "123456789012345";
cp(buf, m, 15);
register int rax __asm__ ("rax") = 60; // exit
register int rdi __asm__ ("rdi") = 0; // status
__asm__ volatile (
"syscall" :: "r" (rax), "r" (rdi) : "cc", "rcx", "r11"
);
__builtin_unreachable();
}
If I compile it with gcc -nostdlib -O1 "./a.c" -o "./a", I get a functioning program, but if I compile it with -O2, I get a program that generates a segmentation fault.
This is the generated code with -O1:
0000000000001000 <cp>:
1000: b8 00 00 00 00 mov $0x0,%eax
1005: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
1009: 88 14 07 mov %dl,(%rdi,%rax,1)
100c: 48 83 c0 01 add $0x1,%rax
1010: 48 83 f8 0f cmp $0xf,%rax
1014: 75 ef jne 1005 <cp+0x5>
1016: c3 retq
0000000000001017 <_start>:
1017: 48 83 ec 30 sub $0x30,%rsp
101b: 48 b8 31 32 33 34 35 movabs $0x3837363534333231,%rax
1022: 36 37 38
1025: 48 ba 39 30 31 32 33 movabs $0x35343332313039,%rdx
102c: 34 35 00
102f: 48 89 04 24 mov %rax,(%rsp)
1033: 48 89 54 24 08 mov %rdx,0x8(%rsp)
1038: 48 89 e6 mov %rsp,%rsi
103b: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
1040: ba 0f 00 00 00 mov $0xf,%edx
1045: e8 b6 ff ff ff callq 1000 <cp>
104a: b8 3c 00 00 00 mov $0x3c,%eax
104f: bf 00 00 00 00 mov $0x0,%edi
1054: 0f 05 syscall
And this is the generated code with -O2:
0000000000001000 <cp>:
1000: 31 c0 xor %eax,%eax
1002: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1008: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
100c: 88 14 07 mov %dl,(%rdi,%rax,1)
100f: 48 83 c0 01 add $0x1,%rax
1013: 48 83 f8 0f cmp $0xf,%rax
1017: 75 ef jne 1008 <cp+0x8>
1019: c3 retq
101a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000001020 <_start>:
1020: 48 8d 44 24 d8 lea -0x28(%rsp),%rax
1025: 48 8d 54 24 c9 lea -0x37(%rsp),%rdx
102a: b9 31 00 00 00 mov $0x31,%ecx
102f: 66 0f 6f 05 c9 0f 00 movdqa 0xfc9(%rip),%xmm0 # 2000 <_start+0xfe0>
1036: 00
1037: 48 8d 70 0f lea 0xf(%rax),%rsi
103b: 0f 29 44 24 c8 movaps %xmm0,-0x38(%rsp)
1040: eb 0d jmp 104f <_start+0x2f>
1042: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1048: 0f b6 0a movzbl (%rdx),%ecx
104b: 48 83 c2 01 add $0x1,%rdx
104f: 88 08 mov %cl,(%rax)
1051: 48 83 c0 01 add $0x1,%rax
1055: 48 39 f0 cmp %rsi,%rax
1058: 75 ee jne 1048 <_start+0x28>
105a: b8 3c 00 00 00 mov $0x3c,%eax
105f: 31 ff xor %edi,%edi
1061: 0f 05 syscall
The crash happens at 103b, instruction movaps %xmm0,-0x38(%rsp).
I noticed that if m contains less than 15 characters, then the generated code is different and the crash does not happen.
What am I doing wrong?
_start is not a function. It's not called by anything, and on entry the stack is 16-byte aligned, not (as the ABI requires) 8 bytes away from 16-byte alignment.
(The ABI requires 16-byte alignment before a call, and call pushes an 8-byte return address. So on function entry RSP-8 and RSP+8 are 16-byte aligned.)
At -O2 GCC uses alignment-required 16-byte instructions to implement the copy done by cp(), copying the "123456789012345" from static storage to the stack.
At -O1, GCC just uses two mov r64, imm64 instructions to get bytes into integer regs for 8-byte stores. These don't require alignment.
Workarounds
Just write a main in C like a normal person if you want everything to work.
Or if you're trying to microbenchmark something light-weight in asm, you can use gcc -nostdlib -O3 -mincoming-stack-boundary=3 (docs) to tell GCC that functions can't assume they're called with more than 8-byte alignment. Unlike -mpreferred-stack-boundary=3, this will still align by 16 before making further calls. So if you have other non-leaf functions, you might want to just use an attribute on your hacky C _start() instead of affecting the whole file.
A worse, more hacky way would be to try putting
asm("push %rax"); at the very top of _start to modify RSP by 8, where GCC hopefully runs it before doing anything else with the stack. GNU C Basic asm statements are implicitly volatile so you don't need asm volatile, although that wouldn't hurt.
You're 100% on your own and responsible for correctly tricking the compiler by using inline asm that works for whatever optimization level you're using.
Another safer way would be write your own light-weight _start that calls main:
// at global scope:
asm(
".globl _start \n"
"_start: \n"
" mov (%rsp), %rdi \n" // argc
" lea 8(%rsp), %rsi \n" // argv
" lea 8(%rsi, %rdi, 8), %rdx \n" // envp
" call main \n"
// NOT DONE: stdio cleanup or other atexit stuff
// DO NOT USE WITH GLIBC; use libc's CRT code if you use libc
" mov %eax, %edi \n"
" mov $231, %eax \n"
" syscall" // exit_group( main() )
);
int main(int argc, char**argv, char**envp) {
... your code here
return 0;
}
If you didn't want main to return, you could just pop %rdi; mov %rsp, %rsi ; jmp main to give it argc and argv without a return address.
Then main can exit via inline asm, or by calling exit() or _exit() if you link libc. (But if you link libc, you should usually use its _start.)
See also: How Get arguments value using inline assembly in C without Glibc? for other hand-rolled _start versions; this is pretty much like #zwol's there.
Related
I want to understand AFL's code instrumentation in detail.
Compiling a sample program sample.c
int main(int argc, char **argv) {
int ret = 0;
if(argc > 1) {
ret = 7;
} else {
ret = 12;
}
return ret;
}
with gcc -c -o obj/sample-gcc.o src/sample.c and afl-gcc -c -o obj/sample-afl-gcc.o src/sample.c and disassembling both with objdump -d leads to different Assembly code:
[GCC]
0000000000000000 <main>:
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: 89 7d ec mov %edi,-0x14(%rbp)
b: 48 89 75 e0 mov %rsi,-0x20(%rbp)
f: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
16: 83 7d ec 01 cmpl $0x1,-0x14(%rbp)
1a: 7e 09 jle 25 <main+0x25>
1c: c7 45 fc 07 00 00 00 movl $0x7,-0x4(%rbp)
23: eb 07 jmp 2c <main+0x2c>
25: c7 45 fc 0c 00 00 00 movl $0xc,-0x4(%rbp)
2c: 8b 45 fc mov -0x4(%rbp),%eax
2f: 5d pop %rbp
30: c3 retq
[AFL-GCC]
0000000000000000 <main>:
0: 48 8d a4 24 68 ff ff lea -0x98(%rsp),%rsp
7: ff
8: 48 89 14 24 mov %rdx,(%rsp)
c: 48 89 4c 24 08 mov %rcx,0x8(%rsp)
11: 48 89 44 24 10 mov %rax,0x10(%rsp)
16: 48 c7 c1 0e ff 00 00 mov $0xff0e,%rcx
1d: e8 00 00 00 00 callq 22 <main+0x22>
22: 48 8b 44 24 10 mov 0x10(%rsp),%rax
27: 48 8b 4c 24 08 mov 0x8(%rsp),%rcx
2c: 48 8b 14 24 mov (%rsp),%rdx
30: 48 8d a4 24 98 00 00 lea 0x98(%rsp),%rsp
37: 00
38: f3 0f 1e fa endbr64
3c: 31 c0 xor %eax,%eax
3e: 83 ff 01 cmp $0x1,%edi
41: 0f 9e c0 setle %al
44: 8d 44 80 07 lea 0x7(%rax,%rax,4),%eax
48: c3 retq
AFL (usually) adds a trampoline in front of every basic block to track executed paths [https://github.com/mirrorer/afl/blob/master/afl-as.h#L130]
-> Instruction 0x00 lea until 0x30 lea
AFL (usually) adds a main payload to the program (which I excluded due to simplicity) [https://github.com/mirrorer/afl/blob/master/afl-as.h#L381]
AFL claims to use a wrapper for GCC, so I expected everything else to be equal. Why is the if-else-condition still compiled differently?
Bonus question: If a binary without source code available should be instrumented manually without using AFL's QEMU-mode or Unicorn-mode, can this be achieved by (naively) adding the main payload and each trampoline manually to the binary file or are there better approaches?
Re: Why the compilation with gcc and with afl-gcc is different, a short look at the afl-gcc source shows that by default it modifies the compiler parameters, setting -O3 -funroll-loops (as well as defining __AFL_COMPILER and FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION).
According to the documentation (docs/env_variables.txt):
By default, the wrapper appends -O3 to optimize builds. Very rarely,
this will cause problems in programs built with -Werror, simply
because -O3 enables more thorough code analysis and can spew out
additional warnings. To disable optimizations, set AFL_DONT_OPTIMIZE.
I've created the following function in c as a demonstration/small riddle about how the stack works in c:
#include "stdio.h"
int* func(int i)
{
int j = 3;
j += i;
return &j;
}
int main()
{
int *tmp = func(4);
printf("%d\n", *tmp);
func(5);
printf("%d\n", *tmp);
}
It's obviously undefined behavior and the compiler also produces a warning about that. However unfortunately the compilation didn't quite work out. For some reason gcc replaces the returned pointer by NULL (see line 6d6).
00000000000006aa <func>:
6aa: 55 push %rbp
6ab: 48 89 e5 mov %rsp,%rbp
6ae: 48 83 ec 20 sub $0x20,%rsp
6b2: 89 7d ec mov %edi,-0x14(%rbp)
6b5: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
6bc: 00 00
6be: 48 89 45 f8 mov %rax,-0x8(%rbp)
6c2: 31 c0 xor %eax,%eax
6c4: c7 45 f4 03 00 00 00 movl $0x3,-0xc(%rbp)
6cb: 8b 55 f4 mov -0xc(%rbp),%edx
6ce: 8b 45 ec mov -0x14(%rbp),%eax
6d1: 01 d0 add %edx,%eax
6d3: 89 45 f4 mov %eax,-0xc(%rbp)
6d6: b8 00 00 00 00 mov $0x0,%eax
6db: 48 8b 4d f8 mov -0x8(%rbp),%rcx
6df: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
6e6: 00 00
6e8: 74 05 je 6ef <func+0x45>
6ea: e8 81 fe ff ff callq 570 <__stack_chk_fail#plt>
6ef: c9 leaveq
6f0: c3 retq
This is the disassembly of the binary compiled with gcc version 7.5.0 and the -O0-flag; no other flags were used. This behavior makes the entire code pointless, since it's supposed to show how the stack behaves across function-calls. Is there any way to achieve a more literal compilation of this code with a at least somewhat up-to-date version of gcc?
And just for the sake of curiosity: what's the point of changing the behavior of the code like this in the first place?
Putting the return value in a pointer variable seems to change the behavior of the compiler and it generates the assembly code that returns a pointer to stack:
int* func(int i) {
int j = 3;
j += i;
int *p = &j;
return p;
}
Background
I'm making an app that needs to run several tasks concurrently. I can't use threads and such because the app should work without any OS (i.e. straight from the bootsector). Using x86 tasks looks like an overkill (both logically and performance-wise). Thus, I decided to implement a task-switching utility myself. I would save processor state, make a call to the task code and then restore the previous state. So I have to make the call from inline assembly.
Problem
Here's some example code:
#include <stdio.h>
void func() {
printf("Hello, world!\n");
}
void (*funcptr)();
int main() {
funcptr = func;
asm(
"call *%0;"
:
:"r"(funcptr)
);
return 0;
}
It compiles perfectly under icc with no options, gcc and clang and yields "Hello, world!" when run. However, if I compile it with icc main.c -ipo, it segfaults.
I disassembled the code that was generated by icc main.c and got the following:
0000000000401220 <main>:
401220: 55 push %rbp
401221: 48 89 e5 mov %rsp,%rbp
401224: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401228: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40122f: bf 03 00 00 00 mov $0x3,%edi
401234: 33 f6 xor %esi,%esi
401236: e8 45 00 00 00 callq 401280 <__intel_new_feature_proc_init>
40123b: 0f ae 1c 24 stmxcsr (%rsp)
40123f: 48 c7 05 f6 78 00 00 movq $0x401270,0x78f6(%rip) # 408b40 <funcptr>
401246: 70 12 40 00
40124a: b8 70 12 40 00 mov $0x401270,%eax
40124f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401256: 0f ae 14 24 ldmxcsr (%rsp)
40125a: ff d0 callq *%rax
40125c: 33 c0 xor %eax,%eax
40125e: 48 89 ec mov %rbp,%rsp
401261: 5d pop %rbp
401262: c3 retq
401263: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401268: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40126f: 00
0000000000401270 <func>:
401270: bf 04 40 40 00 mov $0x404004,%edi
401275: e9 e6 fd ff ff jmpq 401060 <puts#plt>
40127a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
On the other hand, icc main.c -ipo yields:
0000000000401210 <main>:
401210: 55 push %rbp
401211: 48 89 e5 mov %rsp,%rbp
401214: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401218: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40121f: bf 03 00 00 00 mov $0x3,%edi
401224: 33 f6 xor %esi,%esi
401226: e8 25 00 00 00 callq 401250 <__intel_new_feature_proc_init>
40122b: 0f ae 1c 24 stmxcsr (%rsp)
40122f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401236: 48 8b 05 cb 2d 00 00 mov 0x2dcb(%rip),%rax # 404008 <funcptr_2.dp.0>
40123d: 0f ae 14 24 ldmxcsr (%rsp)
401241: ff d0 callq *%rax
401243: 33 c0 xor %eax,%eax
401245: 48 89 ec mov %rbp,%rsp
401248: 5d pop %rbp
401249: c3 retq
40124a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
So, while -ipo didn't remove funcptr variable (see address 401236), it did remove assignment. I guess that icc noticed that func is not called from C code so it can be safely removed, so funcptr is allowed to contain garbage. However, it didn't notice that I'm calling func indirectly via assembly.
What I tried
Replacing "r"(funcptr) with "r"(func) works but I can't hardcode a specific function (see background).
Calling funcptr and/or func before and/or after inline assembly block don't help because icc just inlines printf("Hello, world!\n");.
I can't get rid of inline assembly because I have to do low-level register, flags and stack manipulation before and after call.
Making funcptr volatile yields the following warning but still segfaults:
a value of type "void (*)()" cannot be assigned to an entity of type "volatile void (*)()"
Adding volatile to almost every other word doesn't help either.
Moving func and/or funcptr to other source files and then linking them together doesn't help.
Moving inline assembly to a separate function doesn't work.
Am I doing something wrong or is it an icc bug? If the former, how do I fix the code? If the latter, is there any workaround and should I report the bug?
$ icc --version
icc (ICC) 19.1.0.166 20191121
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.
I compiled a call to printf with different kinds of args.
Here's the code + generated asm:
int main(int argc, char const *argv[]){
// 0: 55 push rbp
// 1: 48 89 e5 mov rbp,rsp
// 4: 48 83 ec 20 sub rsp,0x20
// 8: 89 7d fc mov DWORD PTR [rbp-0x4],edi
// b: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
printf("%s %f %d %f\n", "aye u gonna get some", 133.7f, 420, 69.f);
// f: f2 0f 10 05 00 00 00 00 movsd xmm0,QWORD PTR [rip+0x0] # 17 <main+0x17> 13: R_X86_64_PC32 .rodata+0x2c 69
// 17: 48 8b 05 00 00 00 00 mov rax,QWORD PTR [rip+0x0] # 1e <main+0x1e> 1a: R_X86_64_PC32 .rodata+0x34 133.7
// 1e: 66 0f 28 c8 movapd xmm1,xmm0
// 22: ba a4 01 00 00 mov edx,0x1a4 (420)
// 27: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax
// 2b: f2 0f 10 45 e8 movsd xmm0,QWORD PTR [rbp-0x18]
// 30: 48 8d 35 00 00 00 00 lea rsi,[rip+0x0] # 37 <main+0x37> 33: R_X86_64_PC32 .rodata-0x4 "aye u wanna get some"
// 37: 48 8d 3d 00 00 00 00 lea rdi,[rip+0x0] # 3e <main+0x3e> 3a: R_X86_64_PC32 .rodata+0x18 "%s %f %d %f\n"
// 3e: b8 02 00 00 00 mov eax,0x2
// 43: e8 00 00 00 00 call 48 <main+0x48> 44: R_X86_64_PLT32 printf-0x4
return 0;
// 48: b8 00 00 00 00 mov eax,0x0
// 4d: c9 leave
// 4e: c3 ret
}
Most stuff here makes sense to me.
In fact everything here makes some level of sense to me.
"%s %f %d %f\n" -> rdi
"aye u gonna get some" -> rsi
133.7 -> xmm0
420 -> rdx
69 -> xmm1
2 -> rax (to indicate there are 2 floating point arguments)
Now what I don't understand is how printf (or any other varargs function) would figure out the position of these floating point arguments among the others.
It can't be compiler magic either since it's dynamically linked.
So the only thing I can think of is maybe it's just va_arg internals, and how when you provide a type, if it's floating point, it must get from the xmms (or stack) instead of otherwise.
Is that correct? If not, how does the other side know where to get 'em? Thanks in advance.
For printf the format string indicates the type of the remaining argments.
The implementation of va_arg knows the type as it is an argument of va_arg, and the correct register can be deduced from the types.
I have tried the following code on gcc 4.4.5 on Linux and gcc-llvm on Mac OSX(Xcode 4.2.1) and this. The below are the source and the generated disassembly of the relevant functions. (Added: compiled with gcc -O2 main.c)
#include <stdio.h>
__attribute__((noinline))
static void g(long num)
{
long m, n;
printf("%p %ld\n", &m, n);
return g(num-1);
}
__attribute__((noinline))
static void h(long num)
{
long m, n;
printf("%ld %ld\n", m, n);
return h(num-1);
}
__attribute__((noinline))
static void f(long * num)
{
scanf("%ld", num);
g(*num);
h(*num);
return f(num);
}
int main(void)
{
printf("int:%lu long:%lu unsigned:%lu\n", sizeof(int), sizeof(long), sizeof(unsigned));
long num;
f(&num);
return 0;
}
08048430 <g>:
8048430: 55 push %ebp
8048431: 89 e5 mov %esp,%ebp
8048433: 53 push %ebx
8048434: 89 c3 mov %eax,%ebx
8048436: 83 ec 24 sub $0x24,%esp
8048439: 8d 45 f4 lea -0xc(%ebp),%eax
804843c: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
8048443: 00
8048444: 89 44 24 04 mov %eax,0x4(%esp)
8048448: c7 04 24 d0 85 04 08 movl $0x80485d0,(%esp)
804844f: e8 f0 fe ff ff call 8048344 <printf#plt>
8048454: 8d 43 ff lea -0x1(%ebx),%eax
8048457: e8 d4 ff ff ff call 8048430 <g>
804845c: 83 c4 24 add $0x24,%esp
804845f: 5b pop %ebx
8048460: 5d pop %ebp
8048461: c3 ret
8048462: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
8048469: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
08048470 <h>:
8048470: 55 push %ebp
8048471: 89 e5 mov %esp,%ebp
8048473: 83 ec 18 sub $0x18,%esp
8048476: 66 90 xchg %ax,%ax
8048478: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
804847f: 00
8048480: c7 44 24 04 00 00 00 movl $0x0,0x4(%esp)
8048487: 00
8048488: c7 04 24 d8 85 04 08 movl $0x80485d8,(%esp)
804848f: e8 b0 fe ff ff call 8048344 <printf#plt>
8048494: eb e2 jmp 8048478 <h+0x8>
8048496: 8d 76 00 lea 0x0(%esi),%esi
8048499: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
080484a0 <f>:
80484a0: 55 push %ebp
80484a1: 89 e5 mov %esp,%ebp
80484a3: 53 push %ebx
80484a4: 89 c3 mov %eax,%ebx
80484a6: 83 ec 14 sub $0x14,%esp
80484a9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
80484b0: 89 5c 24 04 mov %ebx,0x4(%esp)
80484b4: c7 04 24 e1 85 04 08 movl $0x80485e1,(%esp)
80484bb: e8 94 fe ff ff call 8048354 <__isoc99_scanf#plt>
80484c0: 8b 03 mov (%ebx),%eax
80484c2: e8 69 ff ff ff call 8048430 <g>
80484c7: 8b 03 mov (%ebx),%eax
80484c9: e8 a2 ff ff ff call 8048470 <h>
80484ce: eb e0 jmp 80484b0 <f+0x10>
We can see that g() and h() are mostly identical except the & (address of) operator beside the argument m of printf()(and the irrelevant %ld and %p).
However, h() is tail-call optimized and g() is not. Why?
In g(), you're taking the address of a local variable and passing it to a function. A "sufficiently smart compiler" should realize that printf does not store that pointer. Instead, gcc and llvm assume that printf might store the pointer somewhere, so the call frame containing m might need to be "live" further down in the recursion. Therefore, no TCO.
It's the & that does it. It tells the compiler that m should be stored on the stack. Even though it is passed to printf, the compiler has to assume that it might be accessed by somebody else and thus must the cleaned from the stack after the call to g.
In this particular case, as printf is known by the compiler (and it knows that it does not save pointers), it could probably be taught to perform this optimization.
For more info on this, look up 'escape anlysis'.