Function body on heap - c

A program has three sections: text, data and stack. The function body lives in the text section. Can we let a function body live on heap? Because we can manipulate memory on heap more freely, we may gain more freedom to manipulate functions.
In the following C code, I copy the text of hello function onto heap and then point a function pointer to it. The program compiles fine by gcc but gives "Segmentation fault" when running.
Could you tell me why?
If my program can not be repaired, could you provide a way to let a function live on heap?
Thanks!
Turing.robot
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
void
hello()
{
printf( "Hello World!\n");
}
int main(void)
{
void (*fp)();
int size = 10000; // large enough to contain hello()
char* buffer;
buffer = (char*) malloc ( size );
memcpy( buffer,(char*)hello,size );
fp = buffer;
fp();
free (buffer);
return 0;
}

My examples below are for Linux x86_64 with gcc, but similar considerations should apply on other systems.
Can we let a function body live on heap?
Yes, absolutely we can. But usually that is called JIT (Just-in-time) compilation. See this for basic idea.
Because we can manipulate memory on heap more freely, we may gain more freedom to manipulate functions.
Exactly, that's why higher level languages like JavaScript have JIT compilers.
In the following C code, I copy the text of hello function onto heap and then point a function pointer to it. The program compiles fine by gcc but gives "Segmentation fault" when running.
Actually you have multiple "Segmentation fault"s in that code.
The first one comes from this line:
int size = 10000; // large enough to contain hello()
If you see x86_64 machine code generated by gcc of your
hello function, it compiles down to mere 17 bytes:
0000000000400626 <hello>:
400626: 55 push %rbp
400627: 48 89 e5 mov %rsp,%rbp
40062a: bf 98 07 40 00 mov $0x400798,%edi
40062f: e8 9c fe ff ff call 4004d0 <puts#plt>
400634: 90 nop
400635: 5d pop %rbp
400636: c3 retq
So, when you are trying to copy 10,000 bytes, you run into a memory
that does not exist and get "Segmentation fault".
Secondly, you allocate memory with malloc, which gives you a slice of
memory that is protected by CPU against execution on Linux x86_64, so
this would give you another "Segmentation fault".
Under the hood malloc uses system calls like brk, sbrk, and mmap to allocate memory. What you need to do is allocate executable memory using mmap system call with PROT_EXEC protection.
Thirdly, when gcc compiles your hello function, you don't really know what optimisations it will use and what the resulting machine code looks like.
For example, if you see line 4 of the compiled hello function
40062f: e8 9c fe ff ff call 4004d0 <puts#plt>
gcc optimised it to use puts function instead of printf, but that is
not even the main problem.
On x86 architectures you normally call functions using call assembly
mnemonic, however, it is not a single instruction, there are actually many different machine instructions that call can compile to, see Intel manual page Vol. 2A 3-123, for reference.
In you case the compiler has chosen to use relative addressing for the call assembly instruction.
You can see that, because your call instruction has e8 opcode:
E8 - Call near, relative, displacement relative to next instruction. 32-bit displacement sign extended to 64-bits in 64-bit mode.
Which basically means that instruction pointer will jump the relative amount of bytes from the current instruction pointer.
Now, when you relocate your code with memcpy to the heap, you simply copy that relative call which will now jump the instruction pointer relative from where you copied your code to into the heap, and that memory will most likely not exist and you will get another "Segmentation fault".
If my program can not be repaired, could you provide a way to let a function live on heap? Thanks!
Below is a working code, here is what I do:
Execute, printf once to make sure gcc includes it in our binary.
Copy the correct size of bytes to heap, in order to not access memory that does not exist.
Allocate executable memory with mmap and PROT_EXEC option.
Pass printf function as argument to our heap_function to make sure
that gcc uses absolute jumps for call instruction.
Here is a working code:
#include "stdio.h"
#include "string.h"
#include <stdint.h>
#include <sys/mman.h>
typedef int (*printf_t)(char* format, char* string);
typedef int (*heap_function_t)(printf_t myprintf, char* str, int a, int b);
int heap_function(printf_t myprintf, char* str, int a, int b) {
myprintf("%s", str);
return a + b;
}
int heap_function_end() {
return 0;
}
int main(void) {
// By printing something here, `gcc` will include `printf`
// function at some address (`0x4004d0` in my case) in our binary,
// with `printf_t` two argument signature.
printf("%s", "Just including printf in binary\n");
// Allocate the correct size of
// executable `PROT_EXEC` memory.
size_t size = (size_t) ((intptr_t) heap_function_end - (intptr_t) heap_function);
char* buffer = (char*) mmap(0, (size_t) size,
PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memcpy(buffer, (char*)heap_function, size);
// Call our function
heap_function_t fp = (heap_function_t) buffer;
int res = fp((void*) printf, "Hello world, from heap!\n", 1, 2);
printf("a + b = %i\n", res);
}
Save in main.c and run with:
gcc -o main main.c && ./main

In principle in concept it is doable. However... You are copying from "hello" which basically contains assembly instructions that possibly call or reference or jump to other addresses. Some of these addresses get fixed up when the application loads. Just copying that and calling into it would then crash. Also some systems like windows have data execution protection that would prevent code in data form being executed, as a security measure. Also, how large is "hello"? Trying to copy past the end of it would likely also crash. And you are also dependent on how the compiler implements "hallo". Needless to say, this would be very compiler and platform dependent, if it worked.

I can imagine that this might work on a very simple architecture or with a compiler designed to make it easy.
A few of the many requirements for this work:
All memory references would need to be absolute ... no pc-relative addresses, except . . .
Certain control transfers would need to be pc-relative (so your copied function's local branches work) but it would be nice if other ones would just happen to be absolute, so your module's external control transfers, like printf(), would work.
There are more requirements. Add to this the wierdness of doing this in what is likely to already be a highly complex dynamically linked environment (did you static link it?) and you simply are not ever going to get this to work.
And as Adam points out, there are security mechanisms in place, at least for the stack, to prevent dynamically constructed code from executing at all. You may need to figure out how to turn these off.
You might also be getting clobbered with the memcpy().
You might learn something by tracing this through step-by-step and watching it shoot itself in the head. If the memcpy hack is the problem, perhaps try something like:
f() {
...
}
g() {
...
}
memcpy(dst, f, (intptr_t)g - (intptr_t)f)

You program is segfaulting because you're memcpy'ing more than just "hello"; that function is not 10000 bytes long, so as soon as you get past hello itself, you segfault because you're accessing memory that doesn't belong to you.
You probably also need to use mmap() at some point to make sure the memory location you're trying to call is actually executable.
There are many systems that do what you seem to want (e.g., Java's JIT compiler creates native code in the heap and executes it), but your example will be way more complicated than that because there's no easy way to know the size of your function at runtime (and it's even harder at compile time, when the compiler hasn't yet decide what optimizations to apply). You can probably do what objdump does and read the executable at runtime to find the right "size", but I don't think that's what you're actually trying to achieve here.

After malloc you should check that the pointer is not null buffer = (char*) malloc ( size );
memcpy( buffer,(char*)hello,size ); and it might be your problem since you try to allocate a big area in memory. can you check that?

memcpy( buffer,(char*)hello,size );
hello is not a source get copied to buffer. You are cheating the compiler and it is taking it's revenge at run-time. By typecasting hello to char*, the program is making the compiler to believe it so, which is not the case actually. Never out-smart the compiler.

Related

Trying to implement enable_execute_stack (Mac OS X)

I have downloaded and compiled Apples source and added it to Xcode.app/Contents/Developer/usr/bin/include/c++/v1. Now how do I go about implementing in C? The code I am working with is from this post about Hackadays shellcode executer. My code is currently like so:
#include <stdio.h>
#include <stdlib.h>
unsigned char shellcode[] = "\x31\xFA......";
int main()
{
int *ret;
ret = (int *)&ret + 2;
(*ret) = (int)shellcode;
printf("2\n");
}
I have compiled with both:
gcc -fno-stack-protector shell.c
clang -fno-stack-protector shell.c
I guess my final question is, how do I tell the compiler to implement "__enable_execute_stack"?
The stack protector is different from an executable stack. That introduces canaries to detect when the stack has been corrupted.
To get an executable stack, you have to link saying to use an executable stack. It goes without saying that this is a bad idea as it makes attacks easier.
The option for the linker is -allow_stack_execute, which turns into the gcc/clang command line:
clang -Wl,-allow_stack_execute -fno-stack-protector shell.c
your code, however, does not try to execute code on the stack, but it does attempt to change a small amount of the stack content, trying to accomplish a return to the shellcode, which is one of the most common ROP attacks.
On a typically compiled OSX 32bit environment this would be attempting to overwrite what is called the linkage area (this is the address of the next instruction that will be called upon function return). This assumes that the code was not compiled with -fomit-frame-pointer. If it's compiled with this option, then you're actually moving one extra address up.
On OSX 64bit it uses the 64bit ABI, the registers are 64bit, and all the values would need to be referenced by long rather than by int, however the manner is similar.
The shellcode you've got there, though, is actually in the data segment of your code (because it's a char [] it means that it's readable/writable, not readable-executable. You would need to either mmap it (like nneonneo's answer) or copy it into the now-executable stack, get it's address and call it that way.
However, if you're just trying to get code to run, then nneonneo's answer makes it pretty easy, but if you're trying to experiment with exploit-y code, then you're going to have to do a little more work. Because of the non-executable stack, the new kids use return-to-library mechanisms, trying to get the return to call, say, one of the exec/system calls with data from the stack.
With modern execution protections in place, it's a bit tricky to get shellcode to run like this. Note that your code is not attempting to execute code on the stack; rather, it is storing the address of the shellcode on the stack, and the actual code is in the program's data segment.
You've got a couple options to make it work:
Put the shellcode in an actual executable section, so it is executable code. You can do this with __attribute__((section("name"))) with GCC and Clang. On OS X:
const char code[] __attribute__((section("__TEXT,__text"))) = "...";
followed by a
((void (*)(void))code)();
works great. On Linux, use the section name ".text" instead.
Use mmap to create a read-write section of memory, copy your shellcode, then mprotect it so it has read-execute permissions, then execute it. This is how modern JITs execute dynamically-generated code. An example:
#include <sys/mman.h>
void execute_code(const void *code, size_t codesize) {
size_t pagesize = (codesize + PAGE_SIZE - 1) & ~(PAGE_SIZE - 1);
void *chunk = mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
if(chunk == MAP_FAILED) return;
memcpy(chunk, code, codesize);
mprotect(chunk, pagesize, PROT_READ|PROT_EXEC);
((void (*)(void)chunk)();
munmap(chunk, pagesize);
}
Neither of these methods requires you to specify any special compiler flags to work properly, and neither of them require fiddling with the saved EIP on the stack.

How does the stack frame look like in my function?

I am a beginner at assembly, and I am curious to know how the stack frame looks like here, so I could access the argument by understanding and not algorithm.
P.S.: the assembly function is process
#include <stdio.h>
# define MAX_LEN 120 // Maximal line size
extern int process(char*);
int main(void) {
char buf[MAX_LEN];
int str_len = 0;
printf("Enter a string:");
fgets(buf, MAX_LEN, stdin);
str_len = process(buf);
So, I know that when I want to access the process function's argument, which is in assembly, I have to do the following:
push ebp
mov ebp, esp ; now ebp is pointing to the same address as esp
pushad
mov ebx, dword [ebp+8]
Now I also would like someone to correct me on things I think are correct:
At the start, esp is pointing to the return address of the function, and [esp+8] is the slot in the stack under it, which is the function's argument
Since the function process has one argument and no inner declarations (not sure about the declarations) then the stack frame, from high to low, is 8 bytes for the argument, 8 bytes for the return address.
Thank you.
There's no way to tell other than by means of debugger. You are using ia32 conventions (ebp, esp) instead of x64 (rbp, rsp), but expecting int / addresses to be 64 bit. It's possible, but not likely.
Compile the program (gcc -O -g foo.c), then run with gdb a.out
#include <stdio.h>
int process(char* a) { printf("%p", (void*)a); }
int main()
{
process((char *)0xabcd1234);
}
Break at process; run; disassemble; inspect registers values and dump the stack.
- break process
- run
- disassemble
- info frame
- info args
- info registers
- x/32x $sp - 16 // to dump stack +-16 bytes in both side of stack pointer
Then add more parameters, a second subroutine or local variables with known values. Single step to the printf routine. What does the stack look like there?
You can also use gdb as calculator: what is the difference in between sp and rax ?
It's print $sp - $rax if you ever want to know.
Tickle your compiler to produce assembler output (on Unixy systems usually with the -S flag). Play around with debugging/non-debugging flags, the extra hints for the debugger might help in refering back to the source. Don't give optimization flags, the reorganizing done by the compiler can lead to thorough confusion. Add a simple function calling into your code to see how it is set up and torn down too.

Print out value of stack pointer

How can I print out the current value at the stack pointer in C in Linux (Debian and Ubuntu)?
I tried google but found no results.
One trick, which is not portable or really even guaranteed to work, is to simple print out the address of a local as a pointer.
void print_stack_pointer() {
void* p = NULL;
printf("%p", (void*)&p);
}
This will essentially print out the address of p which is a good approximation of the current stack pointer
There is no portable way to do that.
In GNU C, this may work for target ISAs that have a register named SP, including x86 where gcc recognizes "SP" as short for ESP or RSP.
// broken with clang, but usually works with GCC
register void *sp asm ("sp");
printf("%p", sp);
This usage of local register variables is now deprecated by GCC:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm
Defining a register variable does not reserve the register. Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc. ...
It's also broken in practice with clang where sp is treated like any other uninitialized variable.
In addition to duedl0r's answer with specifically GCC you could use __builtin_frame_address(0) which is GCC specific (but not x86 specific).
This should also work on Clang (but there are some bugs about it).
Taking the address of a local (as JaredPar answered) is also a solution.
Notice that AFAIK the C standard does not require any call stack in theory.
Remember Appel's paper: garbage collection can be faster than stack allocation; A very weird C implementation could use such a technique! But AFAIK it has never been used for C.
One could dream of a other techniques. And you could have split stacks (at least on recent GCC), in which case the very notion of stack pointer has much less sense (because then the stack is not contiguous, and could be made of many segments of a few call frames each).
On Linuxyou can use the proc pseudo-filesystem to print the stack pointer.
Have a look here, at the /proc/your-pid/stat pseudo-file, at the fields 28, 29.
startstack %lu
The address of the start (i.e., bottom) of the
stack.
kstkesp %lu
The current value of ESP (stack pointer), as found
in the kernel stack page for the process.
You just have to parse these two values!
You can also use an extended assembler instruction, for example:
#include <stdint.h>
uint64_t getsp( void )
{
uint64_t sp;
asm( "mov %%rsp, %0" : "=rm" ( sp ));
return sp;
}
For a 32 bit system, 64 has to be replaced with 32, and rsp with esp.
You have that info in the file /proc/<your-process-id>/maps, in the same line as the string [stack] appears(so it is independent of the compiler or machine). The only downside of this approach is that for that file to be read it is needed to be root.
Try lldb or gdb. For example we can set backtrace format in lldb.
settings set frame-format "frame #${frame.index}: ${ansi.fg.yellow}${frame.pc}: {pc:${frame.pc},fp:${frame.fp},sp:${frame.sp}} ${ansi.normal}{ ${module.file.basename}{\`${function.name-with-args}{${frame.no-debug}${function.pc-offset}}}}{ at ${ansi.fg.cyan}${line.file.basename}${ansi.normal}:${ansi.fg.yellow}${line.number}${ansi.normal}{:${ansi.fg.yellow}${line.column}${ansi.normal}}}{${function.is-optimized} [opt]}{${frame.is-artificial} [artificial]}\n"
So we can print the bp , sp in debug such as
frame #10: 0x208895c4: pc:0x208895c4,fp:0x01f7d458,sp:0x01f7d414 UIKit`-[UIApplication _handleDelegateCallbacksWithOptions:isSuspended:restoreState:] + 376
Look more at https://lldb.llvm.org/use/formatting.html
You can use setjmp. The exact details are implementation dependent, look in the header file.
#include <setjmp.h>
jmp_buf jmp;
setjmp(jmp);
printf("%08x\n", jmp[0].j_esp);
This is also handy when executing unknown code. You can check the sp before and after and do a longjmp to clean up.
If you are using msvc you can use the provided function _AddressOfReturnAddress()
It'll return the address of the return address, which is guaranteed to be the value of RSP at a functions' entry. Once you return from that function, the RSP value will be increased by 8 since the return address is pop'ed off.
Using that information, you can write a simple function that return the current address of the stack pointer like this:
uintptr_t GetStackPointer() {
return (uintptr_t)_AddressOfReturnAddress() + 0x8;
}
int main(int argc, const char argv[]) {
uintptr_t rsp = GetStackPointer();
printf("Stack pointer: %p\n", rsp);
}
Showcase
You may use the following:
uint32_t msp_value = __get_MSP(); // Read Main Stack pointer
By the same way if you want to get the PSP value:
uint32_t psp_value = __get_PSP(); // Read Process Stack pointer
If you want to use assembly language, you can also use MSP and PSP process:
MRS R0, MSP // Read Main Stack pointer to R0
MRS R0, PSP // Read Process Stack pointer to R0

Casting buffer to function and executing in OS X

I'm trying to run some assembly code saved in a buffer on OS X, but I keep getting a segmentation fault. The code looks like this:
int main()
{
unsigned char buff[] = "\x66\x6a\7f\x66\xb8\x01\x00\x00\x00\x66\x83\xec\x04\xcd\x80";
( void (*)()buff )(); /* same as calling return 127 */
return 0; /* program should never reach here */
}
The code in buff is generated by nasm and it works, it causes the program to return 127. When running through a c program like so though, I get a segmentation fault. Is there a different way to do this in OS X?
First, this will not compile, because you are missing the parentheses necessary to make void (*)() a cast. The line should be ((void (*)())buff)();.
Second, if you compile without optimization, buff is likely constructed on the stack, and execution will fail because Mac OS X marks the stack as not executable.
Third, if you compile with optimization, buff is likely prepared in some data segment, and you may be able to execute it. But the instructions you have are inappropriate for the Mac OS X platform, and you get a normal access exception. You could step through the instructions in the debugger to figure out what is wrong.
The behavior of converting an object pointer to a function pointer and calling the function is not defined by the C standard. You should not rely on it to work.
Among the errors in the assembly code:
It moves one to the %ax register, which is the low two bytes of the %rax register. This leaves the high six bytes uncontrolled. Then it attempts to use %rax as an address. This fails because the value in the %rax register is not pointing at accessible memory.
It attempts to execute the instruction int $0x80. This is some Microsoft Windows, DOS, or Linux service call. On Mac OS X, it is an illegal instruction.
The stack is non executable by default -- you need to mark a page as executable with mprotect(2) in order to make it executable. Making the stack executable is highly not recommended, so if you want to run code generated at runtime, you should allocate memory on the heap instead.
For example:
#include <sys/mman.h>
#include <unistd.h>
...
// Error checking omitted for expository purposes
// Allocate 1 page of read-write memory
size_t page_size = getpagesize();
void *mem = mmap(NULL, page_size,
PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE,
-1, 0);
// Copy the shell code into the memory
char shellcode[] = "...";
memcpy(mem, shellcode, sizeof(shellcode));
// Change memory to executable and non-writable
mprotect(mem, page_size, PROT_READ | PROT_EXEC);
// Run the code
((void (*)())mem)();
// Free the memory
munmap(mem, page_size);

How to get c code to execute hex machine code?

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

Resources