Attempting to make my own loader, but cannot implement the data section

Attempting to make my own loader, but cannot implement the data section - c

I am trying to implement my own binary loader for learning purposes, but cannot figure out the data segment.
section .data
helloworld db "hello world", 10
section .text
global _start
test: ;just for testing
ret
_start:
call test
mov rax, 1
mov rbx, 1
mov rcx, helloworld
mov rdx, 11
syscall
mov rax, 60
mov rdi, 0
syscall
This is my assembly program that I am trying to run. I compiled with nasm -f elf64 test.s -o test.o && ld test.o -o test.bin
My loader looks like this:
int main(int argc, char** argv) {
char* bin = argv[1];
struct ElfLib lib = read_elf(bin); //just reading the elf library into the default structures (Elf64_Ehdr, Elf64_Phdr, etc...)
unsigned char* exec = mmap(NULL, DEFAULT_MEM_SIZ, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); //allocating the virtual memory
memset(exec, 0, DEFAULT_MEM_SIZ);
for (int i = 0; i < lib.elf_header.e_phnum; i++) {
Elf64_Phdr phdr = lib.program_headers[i];
fseek(lib.execfile, phdr.p_offset, SEEK_SET);
switch (phdr.p_type) {
case PT_LOAD: {
//load the memory at the file offset into the virtual address of exec
fread(exec + phdr.p_vaddr, sizeof(unsigned char), phdr.p_memsz, lib.execfile);
break;
}
}
int flags = PROT_NONE;
#define HASFLAG(flag) if (phdr.p_flags & flag) flags|=flag
HASFLAG(PROT_EXEC); //execute flag on
HASFLAG(PROT_WRITE); //write flag on
HASFLAG(PROT_READ); //read flag on
mprotect(exec + phdr.p_vaddr, phdr.p_memsz, flags);
}
void (*ex)() = (void*)(exec + lib.elf_header.e_entry);
ex(); //call the _start function in the virtual memory
}
But when I run it, nothing gets printed.
I tried running it under GDB, and the program promptly exits after the exit syscall, with mov rax, 60 and mov rdi, 0, so I know the system call part works. I think that the issue is in the address of helloworld in the hello world program. GDB says that it is still under address 0x402000, which probably is not the same address under the virtual memory. Surprisingly, the test function is at 0x401000 with objdump, but at a completely different one when running with GDB, which does get called. Does anyone have an idea on how to go about implementing this?
I'm not sure how much this will help, but I'm running using x64 Linux under intel.

nasm -f elf64 test.s -o test.o
ld test.o -o test.bin
Unfortunately, I don't have NASM, but if I use GNU assembler instead of NASM, the lines above result in a position-dependent file.
This means that phdr.p_vaddr does not specify a value that is relative to the variable exec, but phdr.p_vaddr specifies an absolute address that must not be changed.
Assuming the symbol helloworld is located at the start of the data segment, the instruction mov rcx, helloworld will simply load the value phdr.p_vaddr into the register rcx - and not the value exec + phdr.p_vaddr.
However, because the address phdr.p_vaddr may already be used, you cannot simply load your code there!
The only possibility that you have if you want to load code from an already running program is so-called "position independent code" that can be loaded at different addresses in memory...
By the way:
64-bit x86 Linux does not take the parameters in rbx, rcx and rdx, but in rdi, rsi and rdx.

Related

Difference between running an assembly program and running the disassembled code in shellcode.c

I am currently working on 'Pentester Academy's x86_64 Assembly Language and Shellcoding on Linux' course (www.pentesteracademy.com/course?id=7). I have one simple question that I can't quite figure out: what is the exact difference between running an assembly program that has been assembled and linked with NASM and ld vs. running the same disassembled program in the classic shellcode.c program (written below). Why use one method over the other?
As an example, when following the first method, I use the commands :
nasm -f elf64 -o execve_stack.o execve_stack.asm
ld -o execve_stack execve_stack.o
./execve_stack
When using the second method, I insert the disassembled shellcode in the shellcode.c program:
#include <stdio.h>
#include <string.h>
unsigned char code[] = \
"\x48\x31\xc0\x50\x48\x89\xe2\x48\xbb\x2f\x62\x69\x6e\x2f\x2f\x73\x68\x53\x48\x89\xe7\x50\x57\x48\x89\xe6\xb0\x3b\x0f\x05";
int main(void) {
printf("Shellcode length: %d\n", (int)strlen(code));
int (*ret)() = (int(*)())code;
ret();
return 0;
}
... and use the commands:
gcc -fno-stack-protector -z execstack -o shellcode shellcode.c
./shellcode
I have analyzed both programs in GDB and found that addresses stored in certain registers differ. I have also read the answer to the following question (C code explanation), which helped me understand the way the shellcode.c program works. Having said that, I still don't fully understand the exact way in which these two methods differ.

There is no theoretical difference between the two methods. In both you end up executing a bunch of assembly instructions on the processor.
The shellcode.c program is there to just demonstrate what would happen if you run the assembly defined as an array of bytes in the unsigned char code[] variable.
Why use one method over the other?
I think you don't understand the purpose of shellcodes and the reasoning behind the shellcode.c program (why it shows what happens when an arbitrary sequence of bytes you have control on is executed on the processor).
A shellcode is a small piece of assembly code that is used to exploit a software vulnerability. An attacker usually injects a shellcode into software by taking advantage of common programming errors such as buffer overflows and then tries to make the software execute that injected shellcode.
A good article showing a step-by-step tutorial on how to generate a shell by performing shellcode injection using buffer overflows can be found here.
Here is how a classic shellcode \x83\xec\x48\x31\xc0\x31\xd2\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80 looks like in assembler:
sub esp, 72
xor eax, eax
xor edx, edx
push eax
push 0x68732f2f ; "hs//" (/ is doubled because you need to push 4 bytes on the stack)
push 0x6e69622f ; "nib/"
mov ebx, esp ; EBX = address of string "/bin//sh"
push eax
push ebx
mov ecx, esp
mov al, 0xb ; EAX = 11 (which is the ID of the sys_execve Linux system call)
int 0x80
In an x86 environment, this does an execve system call with the "/bin/sh" string as parameter.

How to change interpreter path and pass command line arguments to an "executable" shared library on Linux?

Here is a minimal example for an "executable" shared library (assumed file name: mini.c):
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry() {
printf("WooFoo!\n");
exit (0);
}
If one compiles it with e.g.: gcc -fPIC -o mini.so -shared -Wl,-e,entry mini.c. "Running" the resulting .so will look like this:
confus#confusion:~$ ./mini.so
WooFoo!
My question is now:
How do I have to change the above program to pass command line arguments to a call of the .so-file? An example shell session after the change might e.g. look like this:
confus#confusion:~$ ./mini.so 2 bar
1: WooFoo! bar!
2: WooFoo! bar!
confus#confusion:~$ ./mini.so 3 bla
1: WooFoo! bla!
2: WooFoo! bla!
3: WooFoo! bla!
5: WooFoo! Bar!
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly. Otherwise one gets a "Accessing a corrupted shared library" warning. Something like:
#ifdef SIXTY_FOUR_BIT
const char my_interp[] __attribute__((section(".interp"))) = "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#else
const char my_interp[] __attribute__((section(".interp"))) = "/lib/ld-linux.so.2";
#endif
Or even better, to detect the appropriate path fully automatically to ensure it is right for the system the library is compiled on.

How do I have to change the above program to pass command line arguments to a call of the .so-file?
When you run your shared library, argc and argv will be passed to your entry function on the stack.
The problem is that the calling convention used when you compile your shared library on x86_64 linux is going to be that of the System V AMD64 ABI, which doesn't take arguments on the stack but in registers.
You'll need some ASM glue code that fetches argument from the stack and puts them into the right registers.
Here's a simple .asm file you can save as entry.asm and just link with:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
That code copies the arguments from the stack into the appropriate registers, and then calls your entry function in a position-independent way.
You can then just write your entry as if it was a regular main function:
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("WooFoo! Got %d args!\n", argc);
exit (0);
}
And this is how you would then compile your library:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o
The advantage is that you won't have inline asm statements mixed with your C code, instead your real entry point is cleanly abstracted away in a start file.
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly.
Unfortunately, there's no completely clean, reliable way to do that. The best you can do is rely on your preferred compiler having the right defines.
Since you use GCC you can write your C code like this:
#if defined(__x86_64__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#elif defined(__i386__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/ld-linux.so.2";
#else
#error Architecture or compiler not supported
#endif
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("%d: WooFoo!\n", argc);
exit (0);
}
And have two different start files.
One for 64bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
And one for 32bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 32
_entry:
mov edi, [esp]
mov esi, esp
add esi, 4
call .getGOT
.getGOT:
pop ebx
add ebx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
push edi
push esi
jmp entry wrt ..plt
Which means you now have two slightly different ways to compile your library for each target.
For 64bit:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And for 32bit:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
So to sum it up you now have two start files entry.asm and entry32.asm, a set of defines in your mini.c that picks the right interpreter automatically, and two slightly different ways of compiling your library depending on the target.
So if we really want to go all the way, all that's left is to create a Makefile that detects the right target and builds your library accordingly.Let's do just that:
ARCH := $(shell getconf LONG_BIT)
all: build_$(ARCH)
build_32:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
build_64:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And we're done here. Just run make to build your library and let the magic happen.

Add
int argc;
char **argv;
asm("mov 8(%%rbp), %0" : "=&r" (argc));
asm("mov %%rbp, %0\n"
"add $16, %0" : "=&r" (argv));
to the top of your entry function. On x86_64 platforms, this will give you access to the arguments.
The LNW article that John Bollinger linked to in the comments explains why this code works. It might interest you why this is not required when you write a normal C program, or rather, why it does not suffice do just give your entry function the two usual int argc, char **argv arguments: The entry point for a C program normally is not the main function, but instead an assembler function by glibc that does some preparations for you - among others fetch the arguments from the stack - and that eventually (via some intermediate functions) calls your main function. Note that this also means that you might experience other problems, since you skip this initialization! For some history, the cdecl wikipedia page, especially on the difference between x86 and x86_64, might be of further interest.

Passing parameters when compile C and nasm

When compiling C and nasm on Mac OS X, I found it is different to Linux when passing parameters and making system call. My code works but I'm really confused with it.
I write a function myprint in nasm to print string passed from C.
Here is the C code main.c
#include <stdio.h>
void myprint(char* msg, int len);
int main(void){
myprint("hello\n",6);
return 0;
}
Here is the nasm code myprint.asm
section .text
global _myprint
_syscall:
int 0x80
ret
_myprint:
push dword [esp+8] ;after push, esp-4
push dword [esp+8]
push dword 1
mov eax,4
call _syscall
add esp,12
ret
Compile and link them:
nasm -f macho -o myprint.o myprint.asm
gcc -m32 -o main main.c myprint.o
it prints "hello" correctly.
As you can see, OS X(FreeBSD) use push to pass parameters to sys call, but the parameters char* and int are already pushed into the stack and their addresses are esp+4 and esp+8. However, I have to read them from the stack and push them into stack again to make it work.
If I delete
push dword [esp+8] ;after push, esp-4
push dword [esp+8]
it will print lots of error codes and Bus error: 10, like this:
???]?̀?j????????
?hello
`44?4
__mh_execute_headerm"ain1yprint6???;??
<??
(
libSystem.B?
`%?. _syscall__mh_execute_header_main_myprintdyld_stub_binder ??z0&?z?&?z?&?z۽??۽????N?R?N?o?N???N??N?e?N?h?N?0?zR?N???N???t??N?N???N?????#?`#b?`?`?a#c aaU??]N?zBus error: 10
Why it needs to push the parameters into stack again? How can I pass these parameters already in stack to syscall without pushing again?

This is normal for even plain C function calls. The problem is that the return address is still on the stack before the arguments. If you don't push them again below the return address you will have 2 return addresses on the stack before the arguments (the first to main and the second to myprint) but the syscall only expects 1.

Running linked C object files(compiled from C and x86-64 assembly) on Cygwin shell, no output?

Trying to deal with my assignment of Assembly Language...
There are two files, hello.c and world.asm, the professor ask us to compile the two file using gcc and nasm and link the object code together.
I can do it under 64 bit ubuntu 12.10 well, with native gcc and nasm.
But when I try same thing on 64 bit Win8 via cygwin 1.7 (first I try to use gcc but somehow the -m64 option doesn't work, and since the professor ask us to generate the code in 64-bit, I googled and found a package called mingw-w64 which has a compiler x86_64-w64-mingw32-gcc that I can use -m64 with), I can get the files compiled to mainhello.o and world.o and link them to a main.out file, but somehow when I type " ./main.out" and wait for the "Hello world", nothing happens, no output no error message.
New user thus can't post image, sorry about that, here is the screenshot of what happens in the Cygwin shell:
I'm just a newbie to everything, I know I can do the assignment under ubuntu, but I'm just being curious about what's going on here?
Thank you guys
hello.c
//Purpose: Demonstrate outputting integer data using the format specifiers of C.
//
//Compile this source file: gcc -c -Wall -m64 -o mainhello.o hello.c
//Link this object file with all other object files:
//gcc -m64 -o main.out mainhello.o world.o
//Execute in 64-bit protected mode: ./main.out
//
#include <stdio.h>
#include <stdint.h> //For C99 compatability
extern unsigned long int sayhello();
int main(int argc, char* argv[])
{unsigned long int result = -999;
printf("%s\n\n","The main C program will now call the X86-64 subprogram.");
result = sayhello();
printf("%s\n","The subprogram has returned control to main.");
printf("%s%lu\n","The return code is ",result);
printf("%s\n","Bye");
return result;
}
world.asm
;Purpose: Output the famous Hello World message.
;Assemble: nasm -f elf64 -l world.lis -o world.o world.asm
;===== Begin code area
extern printf ;This function will be linked into the executable by the linker
global sayhello
segment .data ;Place initialized data in this segment
welcome db "Hello World", 10, 0
specifierforstringdata db "%s", 10,
segment .bss
segment .text
sayhello:
;Output the famous message
mov qword rax, 0
mov rdi, specifierforstringdata
mov rsi, welcome
call printf
;Prepare to exit from this function
mov qword rax, 0
ret;
;===== End of function sayhello

;Purpose: Output the famous Hello World message.
;Assemble: nasm -f win64 -o world.o world.asm
;===== Begin code area
extern _printf ;This function will be linked into the executable by the linker
global _sayhello
segment .data ;Place initialized data in this segment
welcome db "Hello World", 0
specifierforstringdata db "%s", 10, 0
segment .text
_sayhello:
;Output the famous message
sub rsp, 40 ; shadow space and stack alignment
mov rcx, specifierforstringdata
mov rdx, welcome
call _printf
add rsp, 40 ; clean up stack
;Prepare to exit from this function
mov qword rax, 0
ret
;===== End of function sayhello
When linked together with the C wrapper in the question and run in plain cmd window it works fine:
Depending on your toolchain you might need to remove the leading underscores from symbols.

How to get c code to execute hex machine code?

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?

Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.

Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).

You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.

This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.

You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx

Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.