Minimal 64-bit Windows executable crashes with tail-call optimization enabled by gcc - c

I'm trying to create a minimal 64-bit Windows executable to better understand how the Windows executable format works.
I wrote very basic assembly and C code as follows.
hi.s
section .text
hi:
db "hi", 0
global sayHi
align 16
sayHi:
lea rax, [rel hi]
ret
start.c
extern int puts();
extern const char *sayHi();
void start() {
puts(sayHi());
}
compiled with,
nasm -fwin64 hi.s
gcc -c -ostart.obj -O3 -fno-optimize-sibling-calls start.c
# I will explain the flag
and linked with,
golink /fo r.exe /console start.obj hi.obj msvcrt.dll
# create a console application `r.exe`
# the default entry point is `start`
The program runs fine and prints hi, but note the gcc flag -fno-optimize-sibling-calls. That flag disables tail-call optimizations so that the program always allocates stack space and calls a function. Without the flag, the program crashes.
This is the disassembled result without tail-call optimization. Not sure why gcc put a nop there, but otherwise it's very simple and runs fine.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: e8 ff 2f 00 00 call 0x404010 # puts
401011: 90 nop
401012: 48 83 c4 28 add rsp,0x28
401016: c3 ret
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
This is when tail-call opt is enabled, in which the program crashes.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: 48 83 c4 28 add rsp,0x28
401010: e9 eb 2f 00 00 jmp 0x404000 # puts
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
Now the program doesn't allocate stack space before puts and simply does a jmp instead of call.
I investigated further to see where exactly it jumps when calling puts.
In the no-tail-call case, the called address 0x404010 in the .idata section has the instruction jmp QWORD PTR [rip+0xffffffffffffffea] # 0x404000, and 0x404000 seems to contain the address to puts.
However in the tail-call case, the called address 0x404000 has 54 40 00 00 which is no meaningful instruction. The debugger says the program segfaults at 0x404003, so I'm pretty sure the program chokes trying to execute a garbage instruction.
I must be doing something wrong, but I'm not sure which, so could you explain why the tail-call case fails and how to get it work?

The problem was on golink not correctly handling tail-calls. I searched a while to make GNU ld link the program with the same options given to golink.
You can create a console-mode Windows executable by GNU ld with this command.
ld -o... --subsystem=console object-files...
--subsystem console or -subsystem=console also means the same. Use --subsystem=windows to create a GUI application.
GNU ld also handles Windows dll files, so in this case, simply giving ld a copy of msvcrt.dll from the system folder worked.

Related

Jump to a label from inline assembly to C

I have a written piece of code in assembly and at some points of it, I want to jump to a label in C. So I have the following code (shortened version but still, I am having the same problem):
#include <stdio.h>
#define JE asm volatile("jmp end");
int main(){
printf("hi\n");
JE
printf("Invisible\n");
end:
printf("Visible\n");
return 0;
}
This code compiles, but there is no end label in the disassembled version of the code.
If I change the label name from end to any other thing (let's say l1, both in asm code(jmp l1) and in the C code), the compiler says that
main.c:(.text+0x6b): undefined reference to `l1'
collect2: error: ld returned 1 exit status
Makefile:2: recipe for target 'main' failed
make: *** [main] Error 1
I have tried different things(different length, different cases, upper, lower, etc.) and I think it only compiles with end label. And with end label, I am receiving segmentation fault because, there is no end label in the disassembled version.
Compiled with: gcc -O0 main.c -o main
Disassembled code:
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d af 00 00 00 lea 0xaf(%rip),%rdi # 6f4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts#plt>
64a: e9 c9 09 20 00 jmpq 201018 <_end> # there is no _end label!
64f: 48 8d 3d a1 00 00 00 lea 0xa1(%rip),%rdi # 6f7 <_IO_stdin_used+0x7>
656: e8 b5 fe ff ff callq 510 <puts#plt>
65b: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 701 <_IO_stdin_used+0x11>
662: e8 a9 fe ff ff callq 510 <puts#plt>
667: b8 00 00 00 00 mov $0x0,%eax
66c: 5d pop %rbp
66d: c3 retq
66e: 66 90 xchg %ax,%ax
So, the questions are:
Am I doing something wrong? I have seen this kind of jumps (from
assembly to C) in codes. I can provide example links.
Why the compiler/linker cannot find l1 but can find end?
This is what asm goto is for. GCC Inline Assembly: Jump to label outside block
Note that defining a label inside another asm statement will sometimes work (e.g. with optimization disabled) but IS NOT SAFE.
asm("end:"); // BROKEN; NEVER USE
// except for toy experiments to look at compiler output
GNU C does not define the behaviour of jumping from one asm statement to another without asm goto. The compiler is allowed to assume that execution comes out the end of an asm statement and e.g. put a store after it.
The C end: label within a given function won't just have the asm symbol name of end or _end: - that wouldn't make sense because separate C functions are each allowed to have their own end: label. It could be something like main.end but it turns out GCC and clang just use their usual autonumbered labels like .L123.
Then how this code works: https://github.com/IAIK/transientfail/blob/master/pocs/spectre/PHT/sa_oop/main.c
It doesn't; the end label that asm volatile("je end"); references is in the .data section and happens to be defined by the compiler or linker to mark the end of that section.
asm volatile("je end") has no connection to the C label in that function.
I commented out some of the code in other functions to get it to compile without the "cacheutils.h" header but that didn't affect that part of the oop() function; see https://godbolt.org/z/jabYu3 for disassembly of the linked executable with JE_4k changed to JE_16 so it's not huge. It's disassembly of a linked executable so you can see the numeric address of je 6010f0 <_end> while the oop function itself starts at 4006e0 and ends at 400750. (So it doesn't contain the branch target).
If this happens to work for Spectre exploits, that's because apparently the branch is never actually taken.

Stripped binary shows "_cxa_finalize" instead of "libc_start_main"

Why stripped binary shows _cxa_finalize instead of libc_start_main?
I am trying to locate and disassemble main() in a very simple C program on Linux (Ubuntu). The binary is stripped. Below you can see disassembly (not stripped) vs disassembly (stripped) of the same instructions.
Question: what is _cxa_finalize in the stripped version? Why is libc_start_main is replaced by _cxa_finalize?
Not stripped:
106d: 48 8d 3d c1 00 00 00 lea rdi,[rip+0xc1] # 1135 <main>
1074: ff 15 66 2f 00 00 call QWORD PTR [rip+0x2f66] # 3fe0 <__libc_start_main#GLIBC_2.2.5>
Stripped:
106d: 48 8d 3d c1 00 00 00 lea rdi,[rip+0xc1] # 1135 <__cxa_finalize#plt+0xf5>
1074: ff 15 66 2f 00 00 call QWORD PTR [rip+0x2f66] # 3fe0 <__cxa_finalize#plt+0x2fa0>
It's not __cxa_finalize. It's __cxa_finalize#plt+0xf5 and __cxa_finalize#plt+0x2fa0 (notice the significant offsets). The disassembler has no information about the symbol main or __libc_start_main because you removed the symbol table, but for technical reasons it is still aware of the symbols assocated with PLT thunks (because they're needed for binding at dynamic linking time, and the disassembler probably falls back to using that information when it lacks s symbol table). In general, the disassembler works backward from an address until it finds an address named by a symbol, and assumes (wrongly, here) that the address being disassembled is part of that function.

Does the main function of a C program ever reclaim the stack?

I am working through the OverTheWire wargames and one of my exploits overwrites the return address of main with the address of system. I have then used the fact that at the point main returns, esp is still pointing at one of my local variables and hence I can fill it with the command I want system to run (e.g. sh;#).
My confusion comes from that I thought functions in C reclaim the stack before returning and hence at the point the return address is called the stack pointer would be pointing at the return address rather than at the local variables. However, my exploit works so it seems that my stack pointer is pointing at the local variables when the return address is called.
The main thing I have noticed about this particular challenge compared to others is that it calls exit(0) at the end, instead of just ending, so the assembly doesn't end with leave, which may be the reason for this behaviour.
I haven't included the actual code since it's quite long and I was hoping there was a general explanation for what I am seeing, but please let me know if the assembly would be useful.
#include <stdio.h>
int main ( void )
{
printf("hello\n");
return(0);
}
the interesting relevant parts.
0000000000400430 <main>:
400430: 48 83 ec 08 sub $0x8,%rsp
400434: bf d4 05 40 00 mov $0x4005d4,%edi
400439: e8 c2 ff ff ff callq 400400 <puts#plt>
40043e: 31 c0 xor %eax,%eax
400440: 48 83 c4 08 add $0x8,%rsp
400444: c3 retq
400445: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40044c: 00 00 00
40044f: 90 nop
0000000000400450 <_start>:
400450: 31 ed xor %ebp,%ebp
400452: 49 89 d1 mov %rdx,%r9
400455: 5e pop %rsi
400456: 48 89 e2 mov %rsp,%rdx
400459: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40045d: 50 push %rax
40045e: 54 push %rsp
40045f: 49 c7 c0 c0 05 40 00 mov $0x4005c0,%r8
400466: 48 c7 c1 50 05 40 00 mov $0x400550,%rcx
40046d: 48 c7 c7 30 04 40 00 mov $0x400430,%rdi
400474: e8 97 ff ff ff callq 400410 <__libc_start_main#plt>
400479: f4 hlt
40047a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
For the most part there is nothing special about main nor printf, etc these are just functions that conform to the calling convention. As re-asked SO questions will show sometimes the compiler will add extra stack or other calls when it sees a main() that it doesnt otherwise. but still it is a function that needs to conform to the calling convention. As seen in this case where the stack pointer is put back where it was found.
Before an operating system (Linux, Windows, MacOS, etc) can even think about running a program it needs to allocate some space for that program and tag that memory for that program in some way depending on the features of the processor and the OS, etc. Then you load the program from whatever media, and launch it at the binary file specified and/or well known entry point. A clean exit of the program will cause the operating system to free that memory, which the .text, .data, .bss and stack are the trivial/obvious ones that just go away as their memory just goes away. Other items that may have been allocated and associated with this program, open files, runtime allocated (not stack) memory, etc can/should also be freed, depends on the design of the os and/or the C library as to how that happens.
In the above case we see the bootstrap calls main and main returns then hlt is hit which this is an application not kernel code so that should cause a trap that causes the OS to clean up. An explicit exit() should be no different than a printf() or puts() or fopen() or any other function that ultimately makes one or more syscalls to the operating system. All that you can possibly find for these types of operating systems (Linux, Windows, MacOS) is the syscall. The release of memory happens outside the program as the program does not have control over it, would be a chicken and egg problem, program frees mmu tables that is using to free mmu tables...
compile and disassemble the object for main rather than the whole program
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <main+0xe>
e: 31 c0 xor %eax,%eax
10: 48 83 c4 08 add $0x8,%rsp
14: c3 retq
no surprise there same as before, all we needed to see to understand that the stack was cleaned up before return. and that main is not special:
#include <stdio.h>
int notmain ( void )
{
printf("hello\n");
return(0);
}
0000000000000000 <notmain>:
0: 48 83 ec 08 sub $0x8,%rsp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <notmain+0xe>
e: 31 c0 xor %eax,%eax
10: 48 83 c4 08 add $0x8,%rsp
14: c3 retq
Now if you are asking if there is an exit() within main then sure it wont hit the return point in main so the stack pointer is offset by whatever amount. but if main calls some function and that function calls some function then that function calls exit() then the stack pointer is left at the stack frame point of function number two plus whatever the call (this is an x86) plus the exit() stack frame adds to it. You cannot simply assume that when exit() is called, if it is called, what the stack pointer is pointing at. You would have to examine the disassembly around that call to exit() plus the exit() code and anything it calls, to figure this out.

Generating simple shell binary code to be copied to the stack for stack overflow

I am trying to implement the buffer overflow attack, but I need to generate instruction code in binary so I can put it on the stack to be executed. My problem is that the instructions that I am getting now have jumps to different parts of a program which becomes hard to put on the stack. So I have this simple piece of code (not the code to be exploited) that I want to put on the stack to spawn a new shell.
#include <stdio.h>
int main( ) {
char *buf[2];
buf[0] = "/bin/bash";
buf[1] = NULL;
execve(buf[0], buf, NULL);
}
The code is being compiled with gcc with the following flags:
CFLAGS = -Wall -Wextra -g -fno-stack-protector -m32 -z execstack
LDFLAGS = -fno-stack-protector -m32 -z execstack
Finally using objdump -d -S, I get the following code (parts of it) in hex:
....
....
08048320 <execve#plt>:
8048320: ff 25 08 a0 04 08 jmp *0x804a008
8048326: 68 10 00 00 00 push $0x10
804832b: e9 c0 ff ff ff jmp 80482f0 <_init+0x3c>
....
....
int main( ) {
80483e4: 55 push %ebp
80483e5: 89 e5 mov %esp,%ebp
80483e7: 83 e4 f0 and $0xfffffff0,%esp
80483ea: 83 ec 20 sub $0x20,%esp
char *buf[2];
buf[0] = "/bin/bash";
80483ed: c7 44 24 18 f0 84 04 movl $0x80484f0,0x18(%esp)
80483f4: 08
buf[1] = NULL;
80483f5: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
80483fc: 00
execve(buf[0], buf, NULL);
80483fd: 8b 44 24 18 mov 0x18(%esp),%eax
8048401: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
8048408: 00
8048409: 8d 54 24 18 lea 0x18(%esp),%edx
804840d: 89 54 24 04 mov %edx,0x4(%esp)
8048411: 89 04 24 mov %eax,(%esp)
8048414: e8 07 ff ff ff call 8048320 <execve#plt>
}
8048419: c9 leave
804841a: c3 ret
804841b: 90 nop
As you can see this code is hard to copy onto the stack. execve jumps to a different part of the assembly code to be executed. Is there a way to nicely get a program which can be put compactly on the stack without too much space and branches being used?
If you want a clean assembly code without coding it yourself, do the following:
Move your code to a separate function
Pass the -O0 compilation flag to gcc in order to prevent optimizations
Compile just an object file - use gcc [input file] -o [output file]
Follow these steps, and use the assembly code generated from objdump.
Please keep in mind, you have an external dependency for execve:
8048414: e8 07 ff ff ff call 8048320 <execve#plt>
so you must either include its implementation explicitly and remove the call, or know in advance that the process you want to attack has this function in its address space, and modify the call address to match the process execve address.
Welcome to Stack Overflow (pun intended).
This is an unusual request. Are you trying to get hired by the NSA?
Compilers don't structure assembly code in a very human friendly form, obviously. And their idea of optimization might be for performance rather than for compactness. Therefore, you might consider hand-coding it in assembler, using the compiler output as a guideline to achieve the effect you're going for.
What you have there may not be the best representation of what the compiler can do to inform your investigation in any case. Put the code into a function other than main, so you can just get the minimal stack setup and argument handling necessary, and try compiling it with all the different optimization levels to see what it does.
You're probably getting some extra setup overhead for putting it in main(), because it's the main entry point to your program and has to interface with libc and the OS (just guessing), and may be setting things up for the program's operational context in general (which you could presume would have been done for whichever executable you were inserting the code into so it would be redundant).

Absolute Jumps Within Shared Object Code Unix

I have a question regarding the handling and interpretation of shared libraries.
Suppose, I build a shared object from foo.c using the command:
gcc -shared -fPIC -o libfoo.so foo.c
where foo.c consists of:
#include <stdlib.h>
#include <stdio.h>
int main(void) {
int i;
printf("this is a silly test\n");
if(i)
goto ret;
printf("hello world\n");
ret:
return 0;
}
Now, let's look at the objdump output, specifically that of foo's main:
0000000005ec <main>:
5ec: 55 push %rbp
5ed: 48 89 e5 mov %rsp,%rbp
5f0: 48 83 ec 10 sub $0x10,%rsp
5f4: 48 8d 3d 6b 00 00 00 lea 0x6b(%rip),%rdi # 666 <_fini+0xe>
5fb: e8 00 ff ff ff callq 500 <puts#plt>
600: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
604: 75 0e jne 614 <main+0x28>
606: 48 8d 3d 65 00 00 00 lea 0x65(%rip),%rdi # 672 <_fini+0x1a>
60d: e8 ee fe ff ff callq 500 <puts#plt>
612: eb 01 jmp 615 <main+0x29>
614: 90 nop
615: b8 00 00 00 00 mov $0x0,%eax
61a: c9 leaveq
61b: c3 retq
61c: 90 nop
61d: 90 nop
61e: 90 nop
61f: 90 nop
I can clearly see that calls to puts are being redirected to the PLT, as expected. However, what I don't understand are the instructions at 604 and 612. They are not relative to the IP, nor a call to the PLT. They use an absolute address, based on the
symbol main.
How could this shared library then possibly be used simultaneously betwen several processes? It could (and should) be loaded at different virtual addresses, but the point is that each process should share the implementation stored in RAM. How can different processes with main loaded at different virtual addresses share the instructions at 604 and 612?
They're not absolute jumps, they're PC relative jumps. In fact ALL direct jump and call instructions are PC-relative on x86 -- there are NO absolute direct jumps (so if you want an absolute jump, it has to be indirect).
The reason the callq instruction use the PLT is because the target symbol might be in a different shared object, so even a relative branch won't work (the other shared object might be loaded at any address, independently of this shared object). The PLT itself is actually a small piece of code within the shared object, with a single (indirect) absolute jump for each remote symbol. When the shared objects a dynamically loaded, the absolute address is set appropriately, so when the code runs, the callq instruction will be a pc-relative branch (call) to the PLT which consists of a single indirect jump to the puts routine.

Resources