I need a function which can calculate the length of an x86-64 instruction.
For example, it would be usable like so:
char ret[] = { 0xc3 };
size_t length = instructionLength(ret);
length would be set to 1 in this example.
I do not want to include an entire disassembly library, since the only information I require is the length of the instruction.
I am looking for a minimalist approach, written in C, and ideally as small as possible.
100% complete x86-64 instruction set is not strictly necessary (very obscure ones such as vector register set instructions can be omitted).
A similar answer to what I am looking for (but for the wrong architecture):
Get size of assembly instructions
There is XED library from Intel to work with x86/x86_64 instructions: https://github.com/intelxed/xed, and it is the only correct way to work with intel machine codes.
xed_decode function will provide you all information about instruction: https://intelxed.github.io/ref-manual/group__DEC.html
https://intelxed.github.io/ref-manual/group__DEC.html#ga9a27c2bb97caf98a6024567b261d0652
And xed_ild_decode is for instruction length decoding:
https://intelxed.github.io/ref-manual/group__DEC.html#ga4bef6152f61997a47c4e0fe4327a3254
XED_DLL_EXPORT xed_error_enum_t xed_ild_decode ( xed_decoded_inst_t * xedd,
const xed_uint8_t * itext,
const unsigned int bytes
)
This function just does instruction length decoding.
It does not return a fully decoded instruction.
Parameters
xedd the decoded instruction of type xed_decoded_inst_t . Mode/state sent in via xedd; See the xed_state_t .
itext the pointer to the array of instruction text bytes
bytes the length of the itext input array. 1 to 15 bytes, anything more is ignored.
Returns:
xed_error_enum_t indiciating success (XED_ERROR_NONE) or
failure. Only two failure codes are valid for this function:
XED_ERROR_BUFFER_TOO_SHORT and XED_ERROR_GENERAL_ERROR. In general
this function cannot tell if the instruction is valid or not. For
valid instructions, XED can figure out if enough bytes were provided
to decode the instruction. If not enough were provided, XED returns
XED_ERROR_BUFFER_TOO_SHORT. From this function, the
XED_ERROR_GENERAL_ERROR is an indication that XED could not decode the
instruction's length because the instruction was so invalid that even
its length may across implmentations.
To get length from xedd filled by xed_ild_decode, use xed_decoded_inst_get_length: https://intelxed.github.io/ref-manual/group__DEC.html#gad1051f7b86c94d5670f684a6ea79fcdf
static XED_INLINE xed_uint_t xed_decoded_inst_get_length ( const xed_decoded_inst_t * p )
Return the length of the decoded instruction in bytes.
Example code ("Apache License, Version 2.0", by Intel 2016): https://github.com/intelxed/xed/blob/master/examples/xed-ex-ild.c
#include "xed/xed-interface.h"
#include <stdio.h>
int main()
{
xed_bool_t long_mode = 1;
xed_decoded_inst_t xedd;
xed_state_t dstate;
unsigned char itext[15] = { 0xf2, 0x2e, 0x4f, 0x0F, 0x85, 0x99,
0x00, 0x00, 0x00 };
xed_tables_init(); // one time per process
if (long_mode)
dstate.mmode=XED_MACHINE_MODE_LONG_64;
else
dstate.mmode=XED_MACHINE_MODE_LEGACY_32;
xed_decoded_inst_zero_set_mode(&xedd, &dstate);
xed_ild_decode(&xedd, itext, XED_MAX_INSTRUCTION_BYTES);
printf("length = %u\n",xed_decoded_inst_get_length(&xedd));
return 0;
}
If you're on Windows, you can just use IDebugControl::Disassemble(..., &end_address) from dbgeng.dll. See this question for example usage.
Related
I wrote some code for initializing the IDT, which stores 32-bit addresses in two non-adjacent 16-bit halves. The IDT can be stored anywhere, and you tell the CPU where by running the LIDT instruction.
This is the code for initializing the table:
void idt_init(void) {
/* Unfortunately, we can't write this as loops. The first option,
* initializing the IDT with the addresses, here looping over it, and
* reinitializing the descriptors didn't work because assigning a
* a uintptr_t (from (uintptr_t) handler_func) to a descr (a.k.a.
* uint64_t), according to the compiler, "isn't computable at load
* time."
* The second option, storing the addresses as a local array, simply is
* inefficient (took 0.020ms more when profiling with the "time" command
* line program!).
* The third option, storing the addresses as a static local array,
* consumes too much space (the array will probably never be used again
* during the whole kernel runtime).
* But IF my argument against the third option will be invalidated in
* the future, THEN it's the best option I think. */
/* Initialize descriptors of exception handlers. */
idt[EX_DE_VEC] = idt_trap(ex_de);
idt[EX_DB_VEC] = idt_trap(ex_db);
idt[EX_NMI_VEC] = idt_trap(ex_nmi);
idt[EX_BP_VEC] = idt_trap(ex_bp);
idt[EX_OF_VEC] = idt_trap(ex_of);
idt[EX_BR_VEC] = idt_trap(ex_br);
idt[EX_UD_VEC] = idt_trap(ex_ud);
idt[EX_NM_VEC] = idt_trap(ex_nm);
idt[EX_DF_VEC] = idt_trap(ex_df);
idt[9] = idt_trap(ex_res); /* unused Coprocessor Segment Overrun */
idt[EX_TS_VEC] = idt_trap(ex_ts);
idt[EX_NP_VEC] = idt_trap(ex_np);
idt[EX_SS_VEC] = idt_trap(ex_ss);
idt[EX_GP_VEC] = idt_trap(ex_gp);
idt[EX_PF_VEC] = idt_trap(ex_pf);
idt[15] = idt_trap(ex_res);
idt[EX_MF_VEC] = idt_trap(ex_mf);
idt[EX_AC_VEC] = idt_trap(ex_ac);
idt[EX_MC_VEC] = idt_trap(ex_mc);
idt[EX_XM_VEC] = idt_trap(ex_xm);
idt[EX_VE_VEC] = idt_trap(ex_ve);
/* Initialize descriptors of reserved exceptions.
* Thankfully we compile with -std=c11, so declarations within
* for-loops are possible! */
for (size_t i = 21; i < 32; ++i)
idt[i] = idt_trap(ex_res);
/* Initialize descriptors of hardware interrupt handlers (ISRs). */
idt[INT_8253_VEC] = idt_int(int_8253);
idt[INT_8042_VEC] = idt_int(int_8042);
idt[INT_CASC_VEC] = idt_int(int_casc);
idt[INT_SERIAL2_VEC] = idt_int(int_serial2);
idt[INT_SERIAL1_VEC] = idt_int(int_serial1);
idt[INT_PARALL2_VEC] = idt_int(int_parall2);
idt[INT_FLOPPY_VEC] = idt_int(int_floppy);
idt[INT_PARALL1_VEC] = idt_int(int_parall1);
idt[INT_RTC_VEC] = idt_int(int_rtc);
idt[INT_ACPI_VEC] = idt_int(int_acpi);
idt[INT_OPEN2_VEC] = idt_int(int_open2);
idt[INT_OPEN1_VEC] = idt_int(int_open1);
idt[INT_MOUSE_VEC] = idt_int(int_mouse);
idt[INT_FPU_VEC] = idt_int(int_fpu);
idt[INT_PRIM_ATA_VEC] = idt_int(int_prim_ata);
idt[INT_SEC_ATA_VEC] = idt_int(int_sec_ata);
for (size_t i = 0x30; i < IDT_SIZE; ++i)
idt[i] = idt_trap(ex_res);
}
The macros idt_trap and idt_int, and are defined as follows:
#define idt_entry(off, type, priv) \
((descr) (uintptr_t) (off) & 0xffff) | ((descr) (KERN_CODE & 0xff) << \
0x10) | ((descr) ((type) & 0x0f) << 0x28) | ((descr) ((priv) & \
0x03) << 0x2d) | (descr) 0x800000000000 | \
((descr) ((uintptr_t) (off) & 0xffff0000) << 0x30)
#define idt_int(off) idt_entry(off, 0x0e, 0x00)
#define idt_trap(off) idt_entry(off, 0x0f, 0x00)
idt is an array of uint64_t, so these macros are implicitly cast to that type. uintptr_t is the type guaranteed to be capable of holding pointer values as integers and on 32-bit systems usually 32 bits wide. (A 64-bit IDT has 16-byte entries; this code is for 32-bit).
I get the warning that the initializer element is not constant due to the address modification in play.
It is absolutely sure that the address is known at linking time.
Is there anything I can do to make this work? Making the idt array automatic would work but this would require the whole kernel to run in the context of one function and this would be some bad hassle, I think.
I could make this work by some additional work at runtime (as Linux 0.01 also does) but it just annoys me that something technically feasible at linking time is actually infeasible.
Related: Solution needed for building a static IDT and GDT at assemble/compile/link time - a linker script for ld can shift and mask to break up link-time-constant addresses. No earlier step has the final addresses, and relocation entries are limited in what they can represent in a .o.
The main problem is that function addresses are link-time constants, not strictly compile time constants. The compiler can't just get 32b binary integers and stick that into the data segment in two separate pieces. Instead, it has to use the object file format to indicate to the linker where it should fill in the final value (+ offset) of which symbol when linking is done. The common cases are as an immediate operand to an instruction, a displacement in an effective address, or a value in the data section. (But in all those cases it's still just filling in 32-bit absolute address so all 3 use the same ELF relocation type. There's a different relocation for relative displacements for jump / call offsets.)
It would have been possible for ELF to have been designed to store a symbol reference to be substituted at link time with a complex function of an address (or at least high / low halves like on MIPS for lui $t0, %hi(symbol) / ori $t0, $t0, %lo(symbol) to build address constants from two 16-bit immediates). But in fact the only function allowed is addition/subtraction, for use in things like mov eax, [ext_symbol + 16].
It is of course possible for your OS kernel binary to have a static IDT with fully resolved addresses at build time, so all you need to do at runtime is execute a single lidt instruction. However, the standard
build toolchain is an obstacle. You probably can't achieve this without post-processing your executable.
e.g. you could write it this way, to produce a table with the full padding in the final binary, so the data can be shuffled in-place:
#include <stdint.h>
#define PACKED __attribute__((packed))
// Note, this is the 32-bit format. 64-bit is larger
typedef union idt_entry {
// we will postprocess the linker output to have this format
// (or convert at runtime)
struct PACKED runtime { // from OSdev wiki
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
} rt;
// linker output will be in this format
struct PACKED compiletime {
void *ptr; // offset bits 0..31
uint8_t zero;
uint8_t type_attr;
uint16_t selector; // to be swapped with the high16 of ptr
} ct;
} idt_entry;
// #define idt_ct_entry(off, type, priv) { .ptr = off, .type_attr = type, .selector = priv }
#define idt_ct_trap(off) { .ct = { .ptr = off, .type_attr = 0x0f, .selector = 0x00 } }
// generate an entry in compile-time format
extern void ex_de(); // these are the raw interrupt handlers, written in ASM
extern void ex_db(); // they have to save/restore *all* registers, and end with iret, rather than the usual C ABI.
// it might be easier to use asm macros to create this static data,
// just so it can be in the same file and you don't need cross-file prototypes / declarations
// (but all the same limitations about link-time constants apply)
static idt_entry idt[] = {
idt_ct_trap(ex_de),
idt_ct_trap(ex_db),
// ...
};
// having this static probably takes less space than instructions to write it on the fly
// but not much more. It would be easy to make a lidt function that took a struct pointer.
static const struct PACKED idt_ptr {
uint16_t len; // encoded as bytes - 1, so 0xffff means 65536
void *ptr;
} idt_ptr = { sizeof(idt) - 1, idt };
/****** functions *********/
// inline
void load_static_idt(void) {
asm volatile ("lidt %0"
: // no outputs
: "m" (idt_ptr));
// memory operand, instead of writing the addressing mode ourself, allows a RIP-relative addressing mode in 64bit mode
// also allows it to work with -masm=intel or not.
}
// Do this once at at run-time
// **OR** run this to pre-process the binary, after link time, as part of your build
void idt_convert_to_runtime(void) {
#ifdef DEBUG
static char already_done = 0; // make sure this only runs once
if (already_done)
error;
already_done = 1;
#endif
const int count = sizeof idt / sizeof idt[0];
for (int i=0 ; i<count ; i++) {
uint16_t tmp1 = idt[i].rt.selector;
uint16_t tmp2 = idt[i].rt.offset_2;
idt[i].rt.offset_2 = tmp1;
idt[i].rt.selector = tmp2;
// or do this swap in fewer insns with SSE or MMX pshufw, but using vector instructions before setting up the IDT may be insane.
}
}
This does compile. See a diff of the -m32 and -m64 asm output on the Godbolt compiler explorer. Look at the layout in the data section (note that .value is a synonym for .short, and is 16 bits.) (But note that the IDT table format is different for 64-bit mode.)
I think I have the size calculation correct (bytes - 1), as documented in http://wiki.osdev.org/Interrupt_Descriptor_Table. Minimum value 100h bytes long (encoded as 0x99). See also https://en.wikibooks.org/wiki/X86_Assembly/Global_Descriptor_Table. (lgdt size/pointer works the same way, although the table itself has a different format.)
The other option, instead of having the IDT static in the data section, is to have it in the bss section, with the data stored as immediate constants in a function that will initialize it (or in an array read by that function).
Either way, that function (and its data) can be in a .init section whose memory you re-use after it's done. (Linux does this to reclaim memory from code and data that's only needed once, at startup.) This would give you the optimal tradeoff of small binary size (since 32b addresses are smaller than 64b IDT entries), and no runtime memory wasted on code to set up the IDT. A small loop that runs once at startup is negligible CPU time. (The version on Godbolt fully unrolls because I only have 2 entries, and it embeds the address into each instruction as a 32-bit immediate, even with -Os. With a large enough table (just copy/paste to duplicate a line) you get a compact loop even at -O3. The threshold is lower for -Os.)
Without memory-reuse haxx, probably a tight loop to rewrite 64b entries in place is the way to go. Doing it at build time would be even better, but then you'd need a custom tool to run the tranformation on the kernel binary.
Having the data stored in immediates sounds good in theory, but the code for each entry would probably total more than 64b, because it couldn't loop. The code to split an address into two would have to be fully unrolled (or placed in a function and called). Even if you had a loop to store all the same-for-multiple-entries stuff, each pointer would need a mov r32, imm32 to get the address in a register, then mov word [idt+i + 0], ax / shr eax, 16 / mov word [idt+i + 6], ax. That's a lot of machine-code bytes.
One way would be to use an intermediate jump table that is located at a fixed address. You could initialize the idt with addresses of the locations in this table (which will be compile time constant). The locations in the jump table will contain jump instructions to the actual isr routines.
The dispatch to an isr will be indirect as follows:
trap -> jump to intermediate address in the idt -> jump to isr
One way to create a jump table at a fixed address is as follows.
Step 1: Put jump table in a section
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can also be asm("call ...") depending on need
asm("jmp isr1");
asm("jmp isr2");
}
Step 2: Instruct the linker to locate the section at a fixed address
SECTIONS
{
.so.idt 0x600000 :
{
*(.si.idt)
}
}
Put this in the linker script right after the .text section. This will make sure that the executable code in the table will go into a executable memory region.
You can instruct the linker to use your script as follows using the --script option in the Makefile.
LDFLAGS += -Wl,--script=my_script.lds
The following macro gives you the address of the location which contains the jump (or call) instruction to the corresponding isr.
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
You will then initialize the idt as follows using modified macros.
// your real idt will be initialized as follows
#define idt_entry(addr, type, priv) \
( \
((descr) (uintptr_t) (addr) & 0xffff) | \
((descr) (KERN_CODE & 0xff) << 0x10) | \
((descr) ((type) & 0x0f) << 0x28) | \
((descr) ((priv) & 0x03) << 0x2d) | \
((descr) 0x1 << 0x2F) | \
((descr) ((uintptr_t) (addr) & 0xffff0000) << 0x30) \
)
#define idt_int(off) idt_entry(JUMP_ADDR(off), 0x0e, 0x00)
#define idt_trap(off) idt_entry(JUMP_ADDR(off), 0x0f, 0x00)
descr idt[] =
{
...
idt_trap(ex_de),
...
idt_int(int_casc),
...
};
A demo working example is below, which shows dispatch to a isr with a non-fixed address from a instruction at a fixed address.
#include <stdio.h>
// dummy isrs for demo
void isr0(void) {
printf("==== isr0\n");
}
void isr1(void) {
printf("==== isr1\n");
}
void isr2(void) {
printf("==== isr2\n");
}
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can be asm("call ...")
asm("jmp isr1");
asm("jmp isr2");
}
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
// dummy idt for demo
// see below for the real idt
char* idt[] =
{
JUMP_ADDR(0),
JUMP_ADDR(1),
JUMP_ADDR(2),
};
int main(int argc, char* argv[]) {
int trap;
char* addr = idt[trap = argc - 1];
printf("==== idt[%d]=%p\n", trap, addr);
asm("jmp *%0\n" : :"m"(addr));
}
I am running a bare metal embedded system with an ARM Cortex-M3 (STM32F205). When I try to use snprintf() with float numbers, e.g.:
float f;
f = 1.23;
snprintf(s, 20, "%5.2f", f);
I get garbage into s. The format seems to be honored, i.e. the garbage is a well-formed string with digits, decimal point, and two trailing digits. However, if I repeat the snprintf, the string may change between two calls.
Floating point mathematics seems to work otherwise, and snprintf works with integers, e.g.:
snprintf(s, 20, "%10d", 1234567);
I use the newlib-nano implementation with the -u _printf_float linker switch. The compiler is arm-none-eabi-gcc.
I do have a strong suspicion of memory allocation problems, as integers are printed without any hiccups, but floats act as if they got corrupted in the process. The printf family functions call malloc with floats, not with integers.
The only piece of code not belonging to newlib I am using in this context is my _sbrk(), which is required by malloc.
caddr_t _sbrk(int incr)
{
extern char _Heap_Begin; // Defined by the linker.
extern char _Heap_Limit; // Defined by the linker.
static char* current_heap_end;
char* current_block_address;
// first allocation
if (current_heap_end == 0)
current_heap_end = &_Heap_Begin;
current_block_address = current_heap_end;
// increment and align to 4-octet border
incr = (incr + 3) & (~3);
current_heap_end += incr;
// Overflow?
if (current_heap_end > &_Heap_Limit)
{
errno = ENOMEM;
current_heap_end = current_block_address;
return (caddr_t) - 1;
}
return (caddr_t)current_block_address;
}
As far as I have been able to track, this should work. It seems that no-one ever calls it with negative increments, but I guess that is due to the design of the newlib malloc. The only slightly odd thing is that the first call to _sbrk has a zero increment. (But this may be just malloc's curiosity about the starting address of the heap.)
The stack should not collide with the heap, as there is around 60 KiB RAM for the two. The linker script may be insane, but at least the heap and stack addresses seem to be correct.
As it may happen that someone else gets bitten by the same bug, I post an answer to my own question. However, it was #Notlikethat 's comment which suggested the correct answer.
This is a lesson of Thou shall not steal. I borrowed the gcc linker script which came with the STMCubeMX code generator. Unfortunately, the script along with the startup file is broken.
The relevant part of the original linker script:
_estack = 0x2000ffff;
and its counterparts in the startup script:
Reset_Handler:
ldr sp, =_estack /* set stack pointer */
...
g_pfnVectors:
.word _estack
.word Reset_Handler
...
The first interrupt vector position (at 0) should always point to the startup stack top. When the reset interrupt is reached, it also loads the stack pointer. (As far as I can say, the latter one is unnecessary as the HW anyway reloads the SP from the 0th vector before calling the reset handler.)
The Cortex-M stack pointer should always point to the last item in the stack. At startup there are no items in the stack and thus the pointer should point to the first address above the actual memory, 0x020010000 in this case. With the original linker script the stack pointer is set to 0x0200ffff, which actually results in sp = 0x0200fffc (the hardware forces word-aligned stack). After this the stack is misaligned by 4.
I changed the linker script by removing the constant definition of _estack and replacing it by _stacktop as shown below. The memory definitions were there before. I changed the name just to see where the value is used.
MEMORY
{
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 128K
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 64K
}
_stacktop = ORIGIN(RAM) + LENGTH(RAM);
After this the value of _stacktop is 0x20010000, and my numbers float beautifully... The same problem could arise with any external (library) function using double length parameters, as the ARM Cortex ABI states that the stack must be aligned to 8 octets when calling external functions.
snprintf accepts size as second argument. You might want to go through this example http://www.cplusplus.com/reference/cstdio/snprintf/
/* snprintf example */
#include <stdio.h>
int main ()
{
char buffer [100];
int cx;
cx = snprintf ( buffer, 100, "The half of %d is %d", 60, 60/2 );
snprintf ( buffer+cx, 100-cx, ", and the half of that is %d.", 60/2/2 );
puts (buffer);
return 0;
}
I'm trying to hot patch an exe in memory, the source is available but I'm doing this for learning purposes. (so please no comments suggesting i modify the original source or use detours or any other libs)
Below are the functions I am having problems with.
vm_t* VM_Create( const char *module, intptr_t (*systemCalls)(intptr_t *), vmInterpret_t interpret )
{
MessageBox(NULL, L"Oh snap! We hooked VM_Create!", L"Success!", MB_OK);
return NULL;
}
void Hook_VM_Create(void)
{
DWORD dwBackup;
VirtualProtect((void*)0x00477C3E, 7, PAGE_EXECUTE_READWRITE, &dwBackup);
//Patch the original VM_Create to jump to our detoured one.
BYTE *jmp = (BYTE*)malloc(5);
uint32_t offset = 0x00477C3E - (uint32_t)&VM_Create; //find the offset of the original function from our own
memset((void*)jmp, 0xE9, 1);
memcpy((void*)(jmp+1), &offset, sizeof(offset));
memcpy((void*)0x00477C3E, jmp, 5);
free(jmp);
}
I have a function VM_Create that I want to be called instead of the original function. I have not yet written a trampoline so it crashes (as expected). However the message box does not popup that I have detoured the original VM create to my own. I believe it is the way I'm overwriting the original instructions.
I can see a few issues.
I assume that 0x00477C3E is the address of the original VM_Create function. You really should not hard code this. Use &VM_Create instead. Of course this will mean that you need to use a different name for your replacement function.
The offset is calculated incorrectly. You have the sign wrong. What's more the offset is applied to the instruction pointer at the end of the instruction and not the beginning. So you need to shift it by 5 (the size of the instruction). The offset should be a signed integer also.
Ideally, if you take into account my first point the code would look like this:
int32_t offset = (int32_t)&New_VM_Create - ((int32_t)&VM_Create+5);
Thanks to Hans Passant for fixing my own silly sign error in the original version!
If you are working on a 64 bit machine you need to do your arithmetic in 64 bits and, once you have calculated the offset, truncate it to a 32 bit offset.
Another nuance is that you should reset the memory to being read-only after having written the new JMP instruction, and call FlushInstructionCache.
I'm writing a program where a constant is needed but the value for the constant will be determined during run time. I have an array of op codes from which I want to randomly select one and _emit it into the program's code. Here is an example:
unsigned char opcodes[] = {
0x60, // pushad
0x61, // popad
0x90 // nop
}
int random_byte = rand() % sizeof(opcodes);
__asm _emit opcodes[random_byte]; // optimal goal, but invalid
However, it seems _emit can only take a constant value. E.g, this is valid:
switch(random_byte) {
case 2:
__asm _emit 0x90
break;
}
But this becomes unwieldy if the opcodes array grows to any considerable length, and also essentially eliminates the worth of the array since it would have to be expressed in a less attractive manner.
Is there any way to neatly code this to facilitate the growth of the opcodes array? I've tried other approaches like:
#define OP_0 0x60
#define OP_1 0x61
#define OP_2 0x90
#define DO_EMIT(n) __asm _emit OP_##n
// ...
unsigned char abyte = opcodes[random_byte];
DO_EMIT(abyte)
In this case, the translation comes out as OP_abyte, so it would need a call like DO_EMIT(2), which forces me back to the switch statement and enumerating every element in the array.
It is also quite possible that I have an entirely invalid approach here. Helpful feedback is appreciated.
I'm not sure what compiler/assembler you are using, but you could do what you're after in GCC using a label. At the asm site, you'd write it as:
asm (
"target_opcode: \n"
".byte 0x90\n" ); /* Placeholder byte */
...and at the place where you want to modify that code, you'd use:
extern volatile unsigned char target_opcode[];
int random_byte = rand() % sizeof(opcodes);
target_opcode[0] = random_byte;
Perhaps you can translate this into your compiler's dialect of asm.
Note that all the usual caveats about self-modifying code apply: the code segment might not be writeable, and you may have to flush the I-cache before executing the modified code.
You won't be able to do any randomness in the C preprocessor AFAIK. The closest you could get is generating the random value outside. For instance:
cpp -DRND_VAL=$RANDOM ...
(possibly with a modulus to maintain the value within a range), at least in UNIX-based systems. Then, you can use the definition value, that will be essentially random.
How about
char operation[4]; // is it really only 1 byte all the time?
operation[0] = random_whatever();
operation[1] = 0xC3; // RET
void (*func)() = &operation[0];
func();
Note that in this example you'd need to add a RET instruction to the buffer, so that in the end you end up at the right instruction after calling func().
Using an _emit at runtime into your program code is kind of like compiling the program you're running while the program is running.
You should describe your end-goal rather than just your idea of using _emit at runtime- there might be abetter way to accomplish what you want. Maybe you can write your opcodes to a regular data array and somehow make that bit of memory executable. That might be a little tricky due to security considerations, but it can be done.
I'm attempting to write a simple buffer overflow using C on Mac OS X 10.6 64-bit. Here's the concept:
void function() {
char buffer[64];
buffer[offset] += 7; // i'm not sure how large offset needs to be, or if
// 7 is correct.
}
int main() {
int x = 0;
function();
x += 1;
printf("%d\n", x); // the idea is to modify the return address so that
// the x += 1 expression is not executed and 0 gets
// printed
return 0;
}
Here's part of main's assembler dump:
...
0x0000000100000ebe <main+30>: callq 0x100000e30 <function>
0x0000000100000ec3 <main+35>: movl $0x1,-0x8(%rbp)
0x0000000100000eca <main+42>: mov -0x8(%rbp),%esi
0x0000000100000ecd <main+45>: xor %al,%al
0x0000000100000ecf <main+47>: lea 0x56(%rip),%rdi # 0x100000f2c
0x0000000100000ed6 <main+54>: callq 0x100000ef4 <dyld_stub_printf>
...
I want to jump over the movl instruction, which would mean I'd need to increment the return address by 42 - 35 = 7 (correct?). Now I need to know where the return address is stored so I can calculate the correct offset.
I have tried searching for the correct value manually, but either 1 gets printed or I get abort trap – is there maybe some kind of buffer overflow protection going on?
Using an offset of 88 works on my machine. I used Nemo's approach of finding out the return address.
This 32-bit example illustrates how you can figure it out, see below for 64-bit:
#include <stdio.h>
void function() {
char buffer[64];
char *p;
asm("lea 4(%%ebp),%0" : "=r" (p)); // loads address of return address
printf("%d\n", p - buffer); // computes offset
buffer[p - buffer] += 9; // 9 from disassembling main
}
int main() {
volatile int x = 7;
function();
x++;
printf("x = %d\n", x); // prints 7, not 8
}
On my system the offset is 76. That's the 64 bytes of the buffer (remember, the stack grows down, so the start of the buffer is far from the return address) plus whatever other detritus is in between.
Obviously if you are attacking an existing program you can't expect it to compute the answer for you, but I think this illustrates the principle.
(Also, we are lucky that +9 does not carry out into another byte. Otherwise the single byte increment would not set the return address how we expected. This example may break if you get unlucky with the return address within main)
I overlooked the 64-bitness of the original question somehow. The equivalent for x86-64 is 8(%rbp) because pointers are 8 bytes long. In that case my test build happens to produce an offset of 104. In the code above substitute 8(%%rbp) using the double %% to get a single % in the output assembly. This is described in this ABI document. Search for 8(%rbp).
There is a complaint in the comments that 4(%ebp) is just as magic as 76 or any other arbitrary number. In fact the meaning of the register %ebp (also called the "frame pointer") and its relationship to the location of the return address on the stack is standardized. One illustration I quickly Googled is here. That article uses the terminology "base pointer". If you wanted to exploit buffer overflows on other architectures it would require similarly detailed knowledge of the calling conventions of that CPU.
Roddy is right that you need to operate on pointer-sized values.
I would start by reading values in your exploit function (and printing them) rather than writing them. As you crawl past the end of your array, you should start to see values from the stack. Before long you should find the return address and be able to line it up with your disassembler dump.
Disassemble function() and see what it looks like.
Offset needs to be negative positive, maybe 64+8, as it's a 64-bit address. Also, you should do the '+7' on a pointer-sized object, not on a char. Otherwise if the two addresses cross a 256-byte boundary you will have exploited your exploit....
You might try running your code in a debugger, stepping each assembly line at a time, and examining the stack's memory space as well as registers.
I always like to operate on nice data types, like this one:
struct stackframe {
char *sf_bp;
char *sf_return_address;
};
void function() {
/* the following code is dirty. */
char *dummy;
dummy = (char *)&dummy;
struct stackframe *stackframe = dummy + 24; /* try multiples of 4 here. */
/* here starts the beautiful code. */
stackframe->sf_return_address += 7;
}
Using this code, you can easily check with the debugger whether the value in stackframe->sf_return_address matches your expectations.